IT industry commentators often quote mind-boggling figures to illustrate the huge volumes of data organisations will have to store and process in future. But if you want a more accurate forecast of the challenges you’re likely to face when data gets really big, ask the Met Office.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
Home to some of the world’s largest supercomputers, the organisation today holds around 60PB (petabytes) of data, actively processing around a petabyte a day. Its data archive is increasing by around 1.4PB a week and by 2020 will contain more than 300PB.
“Data management techniques that work with smaller amounts of data – even volumes that a lot of people would today call
‘big’ data – start to break down when you get into these realms,” says CIO Charles Ewen. “We’re facing challenges now that in the next five to 20 years are likely to hit the bulk of organisations. The old store-and-forward model – distributing a dataset to many places – just won’t work any more.”
At the heart of the organisation’s efforts to devise new ways of handling this ever-swelling sea of data sits the Met Office Informatics Lab, whose small team of engineers, scientists and designers work on all the cutting-edge research, development and innovation that ensures the organisation continues to improve its operations, products and services in line with its public service remit to “protect life and prosperity, enhance wellbeing and support economic growth”.
Jacob Tomlinson, the lab’s lead engineer, says weather forecasting is only the surface of what it does. “We’re involved in a huge range of things, from advising wind farms where to site their wind turbines to informing airports precisely how much de-icer to spray on a plane so that none gets wasted.”
Like all public sector organisations, the Met Office is committed to making as much of its data publicly available as possible, but the sheer volumes – and the fact weather data goes stale very quickly – makes that abnormally challenging. It has traditionally focused on pulling out subsets of data to deliver only the information that particular customers need.
“Typically, we’ve had large teams of consultants thinking about how to generate reduced datasets for specific customers,” says Tomlinson. “Now, though, the organisation is working on ways to let people explore and manipulate that data themselves.“
Data in the cloud
Making its vast datasets available via its own systems would quickly saturate the Met Office’s bandwidth, so the first task was to transfer a big chunk of data to a public cloud platform. For this, it used Amazon Web Services’ Snowball service. “They post you a massive hard drive of up to 100TB; you plug it in, fill it full of data and post it back to them to plug in at their end and transfer to the S3 cloud storage platform,” says Tomlinson.
The data that’s been placed in the cloud to date is, at around 80TB, only a fraction of the Met Office’s total store, but Tomlinson says the organisation has focused on data that will be the most useful to the greatest number of customers. “At the moment, it includes all of the model runs we’ve done globally for 2016, plus all the UK runs from 2013 to 2016. We’ll be adding to that over time, putting new data up in batches,” he says.
Unlike a lot of big datasets, weather data is multidimensional so it does not fit into a traditional table structure. Instead, data is stored using “z-order curves”, a mathematical technique that maps multidimensional data onto one dimension while preserving the locality of data points – the same method Google Maps uses to quickly serve up the right data as you zoom in on the map.
Tomlinson says being fully engaged with the open source community has also led to interesting collaborations with other organisations such as the US Army Engineer Research and Development Center and Nasa. “We’ve built some really great relationships with diverse organisations exploring similarly weird large data formats. For example, we recently adapted some brain scanning software to visualise hurricane data,” he says.
Lazy data queries
With the data stored in the cloud, the Informatics Lab has been developing techniques to allow that data to be queried and manipulated quickly over the web via standard application programming interfaces (APIs).
At the front end, the system uses Jupyter Notebook, a browser-based data science environment that lets you create and share interactive documents where you can write and run live Python code, calculations, equations, visualisations and text. Sitting behind this front end are two technologies that point the way towards a leaner, smarter approach to big data management – an approach likely to be relevant to more organisations as data grows bigger and harder to handle – so-called “lazy” applications. The first of these is a Met Office-developed open source library called Iris.
“One of the techniques we’re using to deal with this massive volume of data is to be as lazy as possible,” says Tomlinson. “Iris knows about the way we structure our data, its different dimensions, how it’s packed, etc. You can load in data and perform manipulations – averages, aggregations, regridding and so on. We call it lazy because if you ask it to load 100TB of data it will immediately come back to you and say it has, but it hasn’t really. In fact what it’s done is look through the headers – the metadata – so it’s aware what data it has access to. You can continue performing transformations and Iris keeps responding instantly.”
Only at the point where a user chooses to visualise the results or generate a product does Iris retrieve the specific data it needs to do its calculations – and only that data. “So if I try to load the UK data and then immediately zoom in on London and start doing some manipulations on that, when I get to the end Iris just loads the portion of data for London that it needs to complete my query,” says Tomlinson.
But what if a calculation is particularly complex or processor-intensive? To tackle that problem, the lab is employing another “lazy” open source technology, Dask – a third-party parallel computing library for analytics which it has baked into the next version of Iris. “Dask does something similar to Iris, but for computation. So, when you say you want to run this algorithm or that manipulation, it builds up a large compute graph without executing anything. Only when you try to do something with that final number does it actually perform any calculations,” says Tomlinson.
Once it knows what it needs to calculate, Dask can then split that final calculation up into discrete sections and where required run them simultaneously across a large volume of virtual machines. One illustration of the tool’s power came when the Met Office conducted its initial data upload to S3 and needed to convert its proprietary files to an open format.
“We’d worked out the conversion would take about 2,000 hours, so we created a compute cluster in the cloud comprising 2,000 CPU cores. Then we told Dask to spread the task across all the machines, so instead of taking 2,000 hours, it took just an hour,” says Tomlinson.
Since deployment, the team has been experimenting with applications such as a traffic camera/Met Office data mashup for traffic forecasting, and a natural language chatbot that can talk about the weather. The organisation will be formally opening up the new DataPoint service to third-party developers and the public next year. “We’re really keen to see what people do with it,” says Tomlinson.