The Structure of a Modern Raster
The modern raster is a significant evolution from the old grids and images of the past. It differs in its geometric complexity and flexibility, its data storage flexibility, and its data compression options.
A modern raster can support rasters of virtually unlimited size and will provide linear time access to that data at any scale through the exploitation of overview and underview resolution levels.
A modern raster will provide efficient storage by supporting sparse rasters, by offering lossless and lossy compression, and by offering a rich selection of storage data types.
A modern raster may also support the temporal dimension and cell editing.
Legacy rasters were a simple rectangular array of cells and could be described by providing an origin coordinate, the cell width and height, and the number of rows and columns. Sometimes, a rotation was also supported.
A modern raster extends this simple concept to an extensible and editable sparse matrix of tiles. Tiles are a fixed size rectangular array of cells. Tiles that do not contain data will not exist in the raster and consequently require no storage overhead. Further flexibility might be introduced by allowing a variable cell size. Working from the assumption that the cell size should always be chosen to suit the density of the data observations, this allows the cell size to vary across the raster to suit the local variation in the observation density. In the MapInfo system, we support this at tile scope so a tile may use the base level cell size or some higher multiple of it (bounded by the requirement for a perfect integer number of rows and columns in a tile). For more information on the size of a raster, refer to my article at How big is my raster?
Legacy rasters contained a single data band, and this was soon extended to support multiple data bands – all the same data type and geometry. A modern raster has a deeper hierarchy that might encompass the following concepts – Folder, File, Field, Event, Band, Component.
Most legacy rasters were stored in multiple files, and some required the exclusive use of an entire folder on the storage device. Often a legacy raster would use an ASCII header file and one or more binary data files. Often there will be externally generated cache files for overviews and other metadata files that support the target GIS system. A modern raster is more likely to be contained in a single file so that it is easy to share and transport.
Legacy raster formats were often created for a particular type of data but over time they were used to store other classes of raster data. Consequently, it is not always easy to identify what kind of data is in a legacy raster and how it ought to be presented to a GIS user. The TIFF format is the best example of this. The “Field” concept can be used in a modern raster to provide a high-level definition of what class of data is stored in a raster and to allow multiple different classes of data to be stored within a single raster.
The MapInfo raster system recognises four field types – Continuous, Imagery, Classified, and Image Palette. Continuous and Imagery fields store data “by value” – in other words, cells contain data values. Classified and Image Palette fields store data ‘by reference”. In this case only a single cell value is stored and it is a zero-based row index into a classification table or color palette. The “bands” of the raster include the index band and then the columns in the table. These are sometimes referred to as “virtual bands”.
Continuous fields are the most general containers. They support multiple bands, and each band can be a specific data type for maximum storage efficiency.
Image fields contain a color band which will usually be RGB or RGBA.
Classified fields contain an integer index band. All other virtual bands are columns in the class table. The size of the class table is flexible and can be virtually unlimited.
Image Palette fields contain an index band. All other virtual bands are columns in the color palette. The color palette will contain a single column for the color which will usually be RGB or RGBA. Palettes are usually limited in size.
Components are virtual bands that are derived on-the-fly from a band that contains multi-component data. For example, the 32-bit color type RGBA can be split into 4 8-bit color components – Red, Green, Blue, and Alpha. Other examples include complex numbers (real and imaginary components) and time/date data types.
Events provide support for temporal changes in a raster. This might be a ‘total replacement’ event, where all the data in the raster changes at a specific tile. An example of this might be a rain radar. It might be a ‘partial replacement’ event where only a small portion of the raster is changed. This can also be used for editing a raster without discarding the previous data state. The raster engine can use a “painters algorithm” to combine data over multiple events to acquire data at any time or over any time range.
A modern raster will be ‘multi-resolution’ so that it can provide an appropriate level of resolution for the scale at which the raster is being viewed or processed. This is critical for providing linear performance at any scale and also for providing the highest quality data at any scale. In the MapInfo raster engine, the base level is resolution level 0, overview levels are resolution levels 1 upwards, and underview levels are resolution levels -1 downwards.
The base level is the level at which you populate the cell values in a raster. All edits and events are made at the base level.
Overviews are created from the underlying resolution level by down-sampling. In an overview level, the cell size is larger than the base level, and it increases as you progress up the overview pyramid. Consequently, the size of each overview level is smaller than the preceding level. In a ‘power of 2’ system, each overview cell might be created by averaging the values of the four cells that perfectly overlap it in the resolution level immediately below it. Overviews ought to be prepared in advance and stored in the raster or cached in a separate file. For example, QGIS/GDAL uses the OVR file to cache overviews.
Overview generation is complicated by the field type and by the band data type. Averaging or filtering by-value data (like imagery) is generally straightforward. But by-reference data (like classified fields) cannot be combined and the most common index value needs to be used. Also, some continuous field bands will contain data that cannot be combined and this may need to be treated like by-reference data. An example is the bit-wise error codes in a Landsat scene.
Underviews are created by interpolation from the base level by up-sampling. To interpolate the cell values in an underview level you can use methods like nearest neighbour, bi-linear, and bi-cubic. In the MapInfo raster engine, underviews are created on the fly and are not permanently stored.
A modern raster may require a flexible storage structure. The raster will contain many individual data structures – tiles, tables, overviews, statistics, maps, and metadata. It might take advantage of the file system and use multiple files. An alternative is to use a “file system within a file”. This allows data to be split into separate files, but they are internal to the single file on disk in which the entire raster is stored. MRR uses this system to provide simplicity for users and flexibility for raster storage, processing, and editing.
Finally, a modern raster will provide efficient storage by taking advantage of data compression. This has two advantages – firstly the total file size of the raster will be reduced and secondly the size of data packages transferred over the internet when accessing a raster in the cloud will be minimised. The disadvantage is that the data must be decompressed each time it is accessed from the file, and this provides a small processing overhead.
Most data will be losslessly compressed using well-known algorithms like LZ4 (fast), ZIP (common), and LZMA (best compressor). You can choose to compress imagery using either lossless or lossy algorithms. Lossy algorithms discard some information to reduce the file size and will degrade the quality of the imagery and introduce noise and imagery artifacts. Examples include JPEG (common) and JPEG2000 (wavelet compression).
Lossy wavelet compression has been used in ECW and MrSID formats for many years to reduce the size of large images. I would say that these compression techniques enabled the storage and rendering of very large images because they also provide overviews. However, my personal view is that the time for highly compressed imagery has passed, and lossless image compression is now preferable. If you can afford the increased storage cost and the increased internet traffic (and I think we now can), then lossless compression will provide the highest quality imagery. Also, much high-resolution imagery is now subject to processing by machine learning algorithms. It goes without saying that if you put noisy degraded imagery in, then you will get a poorer quality interpretation out when you apply this kind of processing.