Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

SQL Server On Linux: Forced Unit Access (Fua) Internals

Overview

SQL Server relies on Forced-Unit-Access (Fua) I/O subsystem capabilities to provide data durability, detailed in the following documents: SQL Server 2000 I/O Basic and SQL Server I/O Basics, Chapter 2

Durability: “In database systems, durability is the ACID property which guarantees transactions that have committed will survive permanently. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes.” - https://en.wikipedia.org/wiki/Durability_(database_systems)

Durability is a cornerstone of any database system and starting with SQL Server 2017 on Linux Cumulative Update 6 (CU6), SQL Server on Linux enables “Forced Flush” behavior as described in this article, improving durability on non-Fua optimized systems.

“Be sure to deploy SQL Server 2017 CU6 or newer for best data durability and performance results.

Microsoft SQL Server Database Engine Input/Output Requirements

The Back Story

My involvement began when looking at the Write-Ahead-Logging (WAL) and the write behaviors on a Linux SQL Server. The investigation revealed that some of the Linux file systems may not be Fua optimized. Instead of passing the Fua bit with the write request (like Windows) Linux generates a separate device level flush request. Device level flushing may have an impact on your I/O caching, read ahead or other behaviors of the storage system. The “forced flush” changes in SQL Server avoid flushes, when possible, in order to improve performance on non-optimized Fua file systems.

Starting with the Linux 4.18 kernel additions to the Linux XFS file system can send the Fua bit along with the write request so “forced flush” behavior is not needed and the performance of writes is significantly improved. Testing shows that by using the Fua bit with the data, write request can reduce the I/O traffic by ~50% for a SQL Server, write-intensive workload. 


Performance points of reference:
The following charts highlight the increase in performance.


4 Socket, TPCC

9,108 – Old kernel

36,618 – Patched kernel

2 Socket, TPCC

9,606 – Old kernel

18,840 – Patched kernel

During my investigation, I traced, tested, reviewed, and instrumented the Linux kernel to understand the SQL Server I/O durability needs and associated impacts. This effort resulted in:

· Changes to the SQL Server Engine
· Changes to the SQL Server Linux Host Extension
· Validation with the Windows Storage team
· Validation with Windows Hyper-V team
· Validation with Windows Azure Storage team
· Conversations with Red Hat, HP, SUSE and Intel
· Changes in the Linux file system

Before all was said and done, all of these people and more helped with various activities:

· Eugene, Mike, Slava, Scott, Peter, Steve and others doing code reviews on my SQL Server and Host Extension changes.
· Val and Dylan performing power outage durability testing and adding new tests.
· Mitch and Ben doing functionality testing with the matrix of trace flags, mssql.conf options, Red Hat, Centos, Ubuntu, SUSE, and Windows using xfs, xfs opt, ext4, refs and ntfs.
· Pat validating Docker functionality and updating SQLIOSim to allow forced flush patterns.
· Patrick and Purvi doing performance and regression analytics.
· Kapil building and validating behavior on the Linux kernel changes for xfs (Note: I have tagged it xfs opt (xfs optimizations) in this document for reference.)
· Amit, Lee, Vaqar and Mike handling release activities and documentation.
· Neal, Matt, and others from Windows Storage, Windows Azure Storage, Windows Hyper-V, … validating Windows behaviors.
· Red Hat, SUSE, HPE, Intel helping and confirming Linux behaviors and making file system code changes.
· Pradeep, Venu and Suresh handling support issues.
 

Special thanks to Dave Chinner and the Red Hat team, Dan Williams, Hamesh Patel and their colleagues from Intel, Scott Norton and his colleagues, Josh Paulson and members of the Microsoft Linux teams, Jan Kara from SUSE and many others.

Terms

Before diving into details the terminology is important.

 

Term

Type

Description

Stable Media

Concept

Any storage device that can survive a power outage.  In the 90’s we talked about saving data on spinning media.  SSDs, persistent memory, battery-backed controllers broaden the range of devices surviving power outage from traditional, spinning media.

 

WAL

Concept

Write-Ahead Logging or journaling.  The log records, which can be used to recreate the data, are stored in stable media before the data buffers are written and data buffers are stored in stable media before the supporting log records can be deleted.

 

Fua

Hardware

Fua = Forced Unit Access.  Fua is the standard used to issue an I/O write command in stable data storage.  Fua has been a long-standing part of the SCSI specifications. 

Caution: Non-SCSI storage implementations often silently ignore the request.

 

Device Flush

Hardware

A device level flush forces any dirty buffer located in volatile storage onto stable media.  Flushing a non-battery backed disk or cache device can be performance impacting and repeated device flush commands reduce the optimized capabilities, such as, writes near current head placement, wear level reduction, caching mechanisms, etc.
 

FILE_FLAG_NO_BUFFERING

Windows

FILE_FLAG_NO_BUFFERING is the Win32, CreateFile API flags and attributes setting to bypass file system cache.  This flag does NOT imply flush behaviors, only the avoidance of the system file cache.  FILE_FLAG_NO_BUFFERING maps to the O_DIRECT flag for the Linux open command.

 

Warning: The flag causes the file system cache of a guest VM to be bypassed but the VM or container may not pass the option to the host.

 

O_DIRECT

Linux

Linux open command flag used to bypass file system cache.

 

Warning: The flag causes the file system cache of a guest VM to be bypassed but the VM or container may not pass the option to the host.

 

FILE_FLAG_WRITE_THROUGH

Windows

FILE_FLAG_WRITE_THROUGH is the Win32, CreateFile API flags and attributes setting generating Forced Unit Access (Fua) writes.  A write honoring Fua is issued with the Fua flag and does not return from the write command until transferred to stable media. 

 

FILE_FLAG_WRITE_THROUGH maps to the O_DSYNC flag option for the Linux open command.

 

The FILE_FLAG_WRITE_THROUGH flag does NOT imply FILE_FLAG_NO_BUFFERING behavior. To achieve no buffering and Fua, both the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags must be specified.

 

O_DSYNC | fdatasync | RWF_ODSYNC

Linux

The O_DSYNC | RWF_ODSYNC | fdatasync indicate to the Linux file systems that the ‘data integrity standard’ is to be upheld for I/O operations.  The data integrity standard makes sure the data and supporting meta data, needed to retrieve the data, are stored in stable media.

 

O_SYNC and fsync indicate to the Linux file system to uphold the ‘file integrity standard.’  O_SYNC is a superset of the O_DSYNC as the standard indicates that all file metadata (access timestamps for example) are stored in stable media.

 

RWF_ODSYNC is a newer variation of O_DSYNC.  The RWF_ODSYNC can be set as a flag for each I/O request.  Whereas, opening the file with O_DSYNC establishes flushing behavior for all writes to the file.  SQL Server does NOT make use of RWF_ODSYNC.

 

FILE_FLAG_OVERLAPPED

Windows

FILE_FLAG_OVERLAPPED is the Win32, CreateFile API flags and attributes setting enabling asynchronous API behavior.  Control from APIs such as WriteFile and ReadFile returns as soon as the I/O is queued.  The operating system signals completion when the I/O stack finishes the request.

 

AIO | io_submit | Kernel I/O (kio)

Linux

Linux provides different system calls (syscall)/commands for asynchronous behavior.
 

·         write command – synchronous syscall, does not return until the write completes.

·         aio_write | io_submit – asynchronous syscall, returning control as soon as the I/O request is queued. Sql Server uses aio* for asynchronous I/O operations.

          

Synchronized

Concept

Not to be confused with the synchronous behavior of the API/ABI invocations.  In the context of I/O, synchronized indicates that the storage and in-memory copies of the data are the same.  O_DSYNC often uses this term to indicate that the in-memory buffer is in sync with the storage media.

 

Data Integrity Standard

Concept

This data integrity standard is a subset of the file integrity standard.  In summary, the data integrity standard is the ability of the file system to store and retrieve data over a power outage.  This includes the data itself as well as the tracking structures for the storage.

 

This diagram shows a high-level view of a File.  The file holds the metadata, made up of attributes (time of last access, permissions, …) as well as mappings to actual storage (sectors on a disk, pages on an SSD, …)

 

Flush to stable media occurs for the data and storage mappings to achieve the data integrity standard.

 

Refer to the ISO documentation on data integrity for complete details. CSA ISO/IEC 9945-1:2005 (R2009) 3.375 Synchronized I/O Data Integrity Completion

 

File Integrity Standard

Concept

This file integrity standard is a superset of the data integrity standard.  The file integrity standard is the ability for the file system to store and retrieve data and metadata written over a power outage including attributes such as the time of last access and permissions.

 

Refer to the ISO documentation on file integrity for complete details: CSA ISO/IEC 9945-1:2005 (R2009) 3.376 Synchronized I/O File Integrity Completion

 

O_NONBLOCK

Linux

Linux open command flag value to avoid blocking during an I/O call.  This is not a FILE_FLAG_OVERLAPPED implementation for the Linux ABIs.  Instead, the flag is used to decide if a wait will occur while attempting to issue the I/O request.  If the thread would become blocked the attempt to issue the request is aborted and control is returned to the caller.

 

Overlapped will issue the I/O and return STATUS_PENDING, whereas the O_NONBLOCK returns EAGAIN and the caller must retry the entire I/O operation.

 

O_NOATIME

Linux

Linux open command flag to avoid updating the time of last access which can reduce metadata writes.  The O_NOATIME flag does NOT disable MTIME or CTIME modifications.

 

REQ_FUA

Linux

Block device write request using the Fua capabilities when performing a write.  The Fua bit is sent with the write signaling stable media storage is required.

 

REQ_PREFLUSH
REQ_POSTFLUSH

Linux

A block device write request (0 bytes – write[10, 12 or 16 bytes] SCSI command) requesting a device flush.

 

DpoFua

Linux

DPO = Disable Page Out (for writes instructs I/O subsystem to avoid adding write data to the cache, for reads prevents replacing existing entries in cache.)

 

Fua and Dpo are separate bit values in the SCSI SCB.  SQL Server only requires Fua and not DPO capabilities.

 

Forced Flush Behavior

SQL Server

When enabled the Linux SQL Server opens the file with FILE_FLAG_NO_BUFFERING and without FILE_FLAG_WRITE_THROUGH (no FUA.)  SQL Server log and checkpoint made durable with calls to FlushFileBuffers.

 

Common File Systems Used For SQL Server Deployments

ntfs Windows New Technology File System
refs Windows Resilient File System
xfs High performance, 64-bit journaling file system for Linux
ext4 Fourth extended file system with journaling for Linux

Supported platforms and file systems for SQL Server on Linux

Handling of Fua

The Linux Host Extension(HE) and SQLPAL work together to translate Win32 CreateFile, WriteFile, ReadFile and other I/O calls into Linux system calls. SQL Server opens data and log files using CreateFile with FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH. The flags are converted to the Linux open system call using the O_DIRECT and O_DSYNC flags, respectively.

When SQL Server performs a write, using a file opened with FILE_FLAG_WRITE_THROUGH(O_DSYNC), Linux performs the multi-step operation as a data write followed by the complete device flush request instead of adding the Fua bit to the data write request.

 

  1. Issue write to block device for the data
  2. Receive the data write complete event in the block device’s write completion routine
  3. If O_DSYNC requested, issue block device flush
  4. Wait for block device flush request completion
  5. Return control to write caller

Every write, requiring O_DSYNC behavior, also invokes a block device flush on Linux.  The repeated use of a write request followed by a flush request may be detrimental to performance and will increase traffic on the I/O bus.

Note: The 4.18 Linux Kernel updates allow the XFS file system and a device supporting Fua writes to avoid the additional block device flush request.

Prior to the Linux Kernel 4.18 updates, Linux could use Fua but only for the file system journaling writes and not data writes.  Linux queries the device for the DpoFua support capabilities and when the device reports Fua support (DpoFua = 1) the journaling is optimized to use Fua writes.  The data storage always uses the multi-step flush process. 

Windows does NOT query the device for support, instead any file opened with FILE_FLAG_WRITE_THROUGH sends the Fua option with the write request where the target device is expected to honor the Fua option.  In some instances, the target silently ignores the request. The Windows implementation places the responsibility of making sure the I/O subsystem is Fua compliant on the database and system administrator.

Fua (Forced Unit Access)

Microsoft SQL Server requires Fua compliant storage (stable media) ensuring the data is stored in stable media before a write request returns to SQL Server.  For example, if you run SQL Server on an IDE or SATA drive, you may need to disable the physical disk cache to accommodate WAL (durability), requirements. The SQL Server 2017 CU6, forced flush feature, provides durability assistance on Linux systems without Fua.

 

Note:  Disabling the write cache using the Windows, properties dialog may not suffice.  Some drives are not compliant with the dialog options and require special utilities to configure the drive cache.  Linux may need similar utilities to control various hardware cache installations.

 

You can explore your system using the df and lsblk commands.  When Linux mounts a block device, Linux queries the attributes and capabilities of the block device.  One such capability is the ability to support DpoFua for write requests.  When DpoFua=1 is reported, the device is reporting support for Fua.


Note: These commands may require root permissions to execute.

 

df -T

 

Filesystem          Type     1K-blocks     Used Available Use% Mounted on

devtmpfs            devtmpfs   4062840        0   4062840   0% /dev

tmpfs               tmpfs      4078624        0   4078624   0% /dev/shm

tmpfs               tmpfs      4078624     8880   4069744   1% /run

tmpfs               tmpfs      4078624        0   4078624   0% /sys/fs/cgroup

/dev/mapper/cl-root xfs       52403200 13010400  39392800  25% /

/dev/sda1           xfs        1038336   431168    607168  42% /boot

/dev/mapper/cl-home xfs      114361796 22488816  91872980  20% /home

Share the post

SQL Server On Linux: Forced Unit Access (Fua) Internals

×

Subscribe to Msdn Blogs | Get The Latest Information, Insights, Announcements, And News From Microsoft Experts And Developers In The Msdn Blogs.

Get updates delivered right to your inbox!

Thank you for your subscription

×