The perils of diving into data lakes

Published by Nicholas Woodroof, Editor
Oilfield Technology, Thursday, 02 July 2020 16:30

Murray Callander, Eigen, UK, explains how traditional views of big data benefits in the oilfield are giving way to the rise of micro lakes, distributed twins and distributed intelligence.

Looking back over the past three or four years in the oilfield, big data and data lake projects have been regularly heralded as a key foundation of digital transformation. Companies are justifiably attracted to the idea of pooling all their data into a single source and plugging into machine learning tools to gain detailed insights.

Despite the theoretical benefits of this approach, it is important to consider that data is useless without context, however. If data is simply thrown into a vast lake, without an understanding of what that data is and how it relates to the objective, it can lose much of its intrinsic value.

Data quality is paramount in the search for value from advanced tools such as machine learning and AI. If information is lost in the process of transferring the data into a data lake, it will limit the tools that can be applied and the value that can be extracted later.

Figure 1. In practice, oilfield digitalisation is being enabled by multiple interlinked micro-clouds provided by different vendors.

With this in mind, alongside the proliferation of new services being offered by original equipment manufacturers (OEMs), objectives behind the concept of the data lake (and to some extent the digital twin) are beginning to be challenged and are starting to evolve as a result of three forces:

Agile working and the sheer complexity of an all-encompassing data lake.
The need for very high frequency (and high fidelity) real time data.
New OEM business models including equipment as a service and value-adding monitoring services.

The sheer complexity

The world is moving quickly and the adoption of agile workflows in many of the oil and gas majors is driving new digital initiatives to having to show value quickly. There is little business justification in taking time to architect an all-encompassing data lake. As a result, the part of the lake that is needed for the current minimum viable product (MVP) is built with the intention that it will be expanded by others as needed. Of course, what happens is that the data lake never really gets built properly and ends up being only a partial solution that works for a few use cases.

The need for high fidelity data

In order to monitor vibration in rotating equipment, thousands of values a second are required. Thus, the data storage needed is huge and the network bandwidth required to transport it is correspondingly large. Data at 10hz needs 600 times more storage space than data every one minute. It is therefore inefficient to move all the data into a central data lake and then transmitting it all out again to an analysis application. As a result, it makes much more sense to process and interpret this data as close to the equipment as possible and only transmit the results.

New OEM business models

Vendors are now providing services that guarantee the availability of their equipment. To provide these services they must have full access to the data and will not accept the data transferred via a company’s data lake, over which they have no control. They also need very high fidelity data. The company may have rights to some of this data but probably not all of it, in which case its data lake will have wide gaps.

The emergence of distributed lakes

With suppliers now providing data services, companies often have multiple different partial (micro) digital twins and (micro) data lakes – and usually one with each of their equipment vendors. Consequently, there is still a requirement for a system to consolidate the output from this and to provide an overview, but the need for a single system to be the hub has reduced significantly.

Reference architectures for digitalisation, cloud services and AI are currently shifting from the original thoughts on centralised data lakes to a more realistic distributed ‘micro services’ architecture. A similar evolution has happened in software development with the rise of micro services-based architectures, where instead of one monolithic application that does everything, it is broken down into smaller specialist functions that can exist on their own. This makes it much faster to deploy and manage new functions. The same thing has been seen happening in oil and gas.

As a result, the digital twin is now giving way to the distributed twin – and the data lake is also giving way to what is being termed as the ‘distributed lake’.

Why is this change happening?

Those that are best placed to interpret and manage that data – the equipment suppliers – are in the best position to learn from the most recent information on how to best analyse the health of their own equipment.

Suppliers have access to a broad footprint of installed base, including large quantities of training data, and are therefore better positioned to interpret any new data. With a need for direct access to the equipment, and the data from it, an architecture where suppliers have to navigate through a third party data lake is far less appealing – and in the case of an equipment as service contract, would be contractually untenable.

Furthermore, of the recent successful implementations of machine learning Eigen have witnessed being undertaken by operators, none of them have been on individual equipment failure prediction. The reason for this is that it is extremely hard to get enough quality training data with access to only one or two items of a particular piece of equipment.

Figure 2. The concept of a single cloud or data lake is turning out to be impractical.

On a wider industry level, companies such as Rolls Royce have been doing this for some time. In supplying power drives they are now selling a service that monitors the health of those drives, connecting directly to their equipment, and with access to this information available in the cloud.

In this example all the data is going through the partner company’s networks – and there is no driver for it to go straight to the supplier’s cloud. Big business is moving this way, and soon the idea of oil companies owning a big lake where all the data goes in, and then sharing the data from it with third parties, is also likely to decline.

Companies with multiple units of equipment and different suppliers will have multiple versions of integrated services. This will enable high frequency data, going from package equipment to third party supplier, and then to a portal. From that supplier, companies will also see that APIs are available to those services, allowing them to pull data back.

The bulk of the raw data will no longer be going through oil companies’ databases and they probably will not have a copy of it. This is analogous to traditional drilling, of course. It is a common occurence for drilling service companies to obtain data and provide a service where they analyse what is going on and enable companies to view it whilst they are drilling – thereby getting paid for the service to view the interpreted data.

Here, it is easy to see a situation where there are a variety of functioning micro clouds, or micro lakes, dotted around. To some extent this is a repetition of the same architecture currently in existence, with many different functional systems sitting offshore, except now these different functional systems are taken up a level into the cloud.

With a lot of smaller clouds with their own portions of the data, operators do not want to have to go into every different portal to see if there are any issues. They will need a summary data feed with alerts feeding a consolidated overview that helps them understand system status across everything. In addition, they also want the ability to drill down to any one of the various portals available if they need to look at something specific.

That situation is a huge shift from the equivalent of a single lake, with everything in it as a single version of the truth. With the new distributed data requirements, players acting as the gateway to all a company’s data will struggle to survive.

Service companies and equipment companies will require a direct connection to the equipment in order to successfully monitor and make changes to the data stream, change the frequency, or perform software updates.

Having everything go through a single operator owned data lake is unlikely to be tenable in the future.

Opportunities for operators with machine learning and AI

Successful operators are focusing on making their own internal processes as efficient and integrated as possible. Currently, they have a lot of information being generated by physical assets, but they have very little information coming off their organisational assets.

By enabling a flow of event-based organisation data, operators can then start to monitor what the organisation is doing in parallel with its processes. Armed with event-based data, operators have a view across the entire workflow, making it easy to identify problems areas.

With this approach to the flow of information, operators can focus on the big picture and optimising the whole throughput, from an organisational as well as process perspective.

The parallels with internet use, user behaviour and social media are clear and it is likely that the upstream industry will soon be emulating information flow practices in the same way – with the single source of data approach becoming increasingly redundant.

Read the article online at: https://www.oilfieldtechnology.com/special-reports/02072020/the-perils-of-diving-into-data-lakes/

Choosing the right hose

Monday 30 June 2025 12:00

Jonathan Petit, Trelleborg Oil and Marine, argues that, amid the surge in global oil demand, with Ship-to-Ship (STS) operations growing in popularity, choosing the right marine hoses will play a key role in delivering the cost efficiencies required to take full advantage of this opportunity.