Overview
SQL Server relies on Forced-Unit-Access (Fua) I/O subsystem capabilities to provide data durability, detailed in the following documents: SQL Server 2000 I/O Basic and SQL Server I/O Basics, Chapter 2
Durability: “In database systems, durability is the ACID property which guarantees transactions that have committed will survive permanently. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes.” - https://en.wikipedia.org/wiki/Durability_(database_systems)
Durability is a cornerstone of any database system and starting with SQL Server 2017 on Linux Cumulative Update 6 (CU6), SQL Server on Linux enables “Forced Flush” behavior as described in this article, improving durability on non-Fua optimized systems.
“Be sure to deploy SQL Server 2017 CU6 or newer for best data durability and performance results.“
Microsoft SQL Server Database Engine Input/Output Requirements
The Back Story
My involvement began when looking at the Write-Ahead-Logging (WAL) and the write behaviors on a Linux SQL Server. The investigation revealed that some of the Linux file systems may not be Fua optimized. Instead of passing the Fua bit with the write request (like Windows) Linux generates a separate device level flush request. Device level flushing may have an impact on your I/O caching, read ahead or other behaviors of the storage system. The “forced flush” changes in SQL Server avoid flushes, when possible, in order to improve performance on non-optimized Fua file systems.
Starting with the Linux 4.18 kernel additions to the Linux XFS file system can send the Fua bit along with the write request so “forced flush” behavior is not needed and the performance of writes is significantly improved. Testing shows that by using the Fua bit with the data, write request can reduce the I/O traffic by ~50% for a SQL Server, write-intensive workload. |
Performance points of reference: The following charts highlight the increase in performance.
4 Socket, TPCC 9,108 – Old kernel 36,618 – Patched kernel |
2 Socket, TPCC 9,606 – Old kernel 18,840 – Patched kernel |
During my investigation, I traced, tested, reviewed, and instrumented the Linux kernel to understand the SQL Server I/O durability needs and associated impacts. This effort resulted in:
· Changes to the SQL Server Engine
· Changes to the SQL Server Linux Host Extension
· Validation with the Windows Storage team
· Validation with Windows Hyper-V team
· Validation with Windows Azure Storage team
· Conversations with Red Hat, HP, SUSE and Intel
· Changes in the Linux file system
Before all was said and done, all of these people and more helped with various activities:
· Eugene, Mike, Slava, Scott, Peter, Steve and others doing code reviews on my SQL Server and Host Extension changes.
· Val and Dylan performing power outage durability testing and adding new tests.
· Mitch and Ben doing functionality testing with the matrix of trace flags, mssql.conf options, Red Hat, Centos, Ubuntu, SUSE, and Windows using xfs, xfs opt, ext4, refs and ntfs.
· Pat validating Docker functionality and updating SQLIOSim to allow forced flush patterns.
· Patrick and Purvi doing performance and regression analytics.
· Kapil building and validating behavior on the Linux kernel changes for xfs (Note: I have tagged it xfs opt (xfs optimizations) in this document for reference.)
· Amit, Lee, Vaqar and Mike handling release activities and documentation.
· Neal, Matt, and others from Windows Storage, Windows Azure Storage, Windows Hyper-V, … validating Windows behaviors.
· Red Hat, SUSE, HPE, Intel helping and confirming Linux behaviors and making file system code changes.
· Pradeep, Venu and Suresh handling support issues.
Special thanks to Dave Chinner and the Red Hat team, Dan Williams, Hamesh Patel and their colleagues from Intel, Scott Norton and his colleagues, Josh Paulson and members of the Microsoft Linux teams, Jan Kara from SUSE and many others.
Terms
Before diving into details the terminology is important.
Term |
Type |
Description |
||
Stable Media |
Concept |
Any storage device that can survive a power outage. In the 90’s we talked about saving data on spinning media. SSDs, persistent memory, battery-backed controllers broaden the range of devices surviving power outage from traditional, spinning media.
|
||
WAL |
Concept |
Write-Ahead Logging or journaling. The log records, which can be used to recreate the data, are stored in stable media before the data buffers are written and data buffers are stored in stable media before the supporting log records can be deleted.
|
||
Fua |
Hardware |
Fua = Forced Unit Access. Fua is the standard used to issue an I/O write command in stable data storage. Fua has been a long-standing part of the SCSI specifications.
|
||
Device Flush |
Hardware |
A device level flush forces any dirty buffer located in volatile storage onto stable media. Flushing a non-battery backed disk or cache device can be performance impacting and repeated device flush commands reduce the optimized capabilities, such as, writes near current head placement, wear level reduction, caching mechanisms, etc. |
||
FILE_FLAG_NO_BUFFERING |
Windows |
FILE_FLAG_NO_BUFFERING is the Win32, CreateFile API flags and attributes setting to bypass file system cache. This flag does NOT imply flush behaviors, only the avoidance of the system file cache. FILE_FLAG_NO_BUFFERING maps to the O_DIRECT flag for the Linux open command.
Warning: The flag causes the file system cache of a guest VM to be bypassed but the VM or container may not pass the option to the host.
|
||
O_DIRECT |
Linux |
Linux open command flag used to bypass file system cache.
Warning: The flag causes the file system cache of a guest VM to be bypassed but the VM or container may not pass the option to the host.
|
||
FILE_FLAG_WRITE_THROUGH |
Windows |
FILE_FLAG_WRITE_THROUGH is the Win32, CreateFile API flags and attributes setting generating Forced Unit Access (Fua) writes. A write honoring Fua is issued with the Fua flag and does not return from the write command until transferred to stable media.
FILE_FLAG_WRITE_THROUGH maps to the O_DSYNC flag option for the Linux open command.
The FILE_FLAG_WRITE_THROUGH flag does NOT imply FILE_FLAG_NO_BUFFERING behavior. To achieve no buffering and Fua, both the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags must be specified.
|
||
O_DSYNC | fdatasync | RWF_ODSYNC |
Linux |
The O_DSYNC | RWF_ODSYNC | fdatasync indicate to the Linux file systems that the ‘data integrity standard’ is to be upheld for I/O operations. The data integrity standard makes sure the data and supporting meta data, needed to retrieve the data, are stored in stable media.
O_SYNC and fsync indicate to the Linux file system to uphold the ‘file integrity standard.’ O_SYNC is a superset of the O_DSYNC as the standard indicates that all file metadata (access timestamps for example) are stored in stable media.
RWF_ODSYNC is a newer variation of O_DSYNC. The RWF_ODSYNC can be set as a flag for each I/O request. Whereas, opening the file with O_DSYNC establishes flushing behavior for all writes to the file. SQL Server does NOT make use of RWF_ODSYNC.
|
||
FILE_FLAG_OVERLAPPED |
Windows |
FILE_FLAG_OVERLAPPED is the Win32, CreateFile API flags and attributes setting enabling asynchronous API behavior. Control from APIs such as WriteFile and ReadFile returns as soon as the I/O is queued. The operating system signals completion when the I/O stack finishes the request.
|
||
AIO | io_submit | Kernel I/O (kio) |
Linux |
Linux provides different system calls (syscall)/commands for asynchronous behavior. · write command – synchronous syscall, does not return until the write completes. · aio_write | io_submit – asynchronous syscall, returning control as soon as the I/O request is queued. Sql Server uses aio* for asynchronous I/O operations.
|
||
Synchronized |
Concept |
Not to be confused with the synchronous behavior of the API/ABI invocations. In the context of I/O, synchronized indicates that the storage and in-memory copies of the data are the same. O_DSYNC often uses this term to indicate that the in-memory buffer is in sync with the storage media.
|
||
Data Integrity Standard |
Concept |
This data integrity standard is a subset of the file integrity standard. In summary, the data integrity standard is the ability of the file system to store and retrieve data over a power outage. This includes the data itself as well as the tracking structures for the storage.
Refer to the ISO documentation on data integrity for complete details. CSA ISO/IEC 9945-1:2005 (R2009) 3.375 Synchronized I/O Data Integrity Completion
|
||
File Integrity Standard |
Concept |
This file integrity standard is a superset of the data integrity standard. The file integrity standard is the ability for the file system to store and retrieve data and metadata written over a power outage including attributes such as the time of last access and permissions.
Refer to the ISO documentation on file integrity for complete details: CSA ISO/IEC 9945-1:2005 (R2009) 3.376 Synchronized I/O File Integrity Completion
|
||
O_NONBLOCK |
Linux |
Linux open command flag value to avoid blocking during an I/O call. This is not a FILE_FLAG_OVERLAPPED implementation for the Linux ABIs. Instead, the flag is used to decide if a wait will occur while attempting to issue the I/O request. If the thread would become blocked the attempt to issue the request is aborted and control is returned to the caller.
Overlapped will issue the I/O and return STATUS_PENDING, whereas the O_NONBLOCK returns EAGAIN and the caller must retry the entire I/O operation.
|
||
O_NOATIME |
Linux |
Linux open command flag to avoid updating the time of last access which can reduce metadata writes. The O_NOATIME flag does NOT disable MTIME or CTIME modifications.
|
||
REQ_FUA |
Linux |
Block device write request using the Fua capabilities when performing a write. The Fua bit is sent with the write signaling stable media storage is required.
|
||
REQ_PREFLUSH |
Linux |
A block device write request (0 bytes – write[10, 12 or 16 bytes] SCSI command) requesting a device flush.
|
||
DpoFua |
Linux |
DPO = Disable Page Out (for writes instructs I/O subsystem to avoid adding write data to the cache, for reads prevents replacing existing entries in cache.)
Fua and Dpo are separate bit values in the SCSI SCB. SQL Server only requires Fua and not DPO capabilities.
|
||
Forced Flush Behavior |
SQL Server |
When enabled the Linux SQL Server opens the file with FILE_FLAG_NO_BUFFERING and without FILE_FLAG_WRITE_THROUGH (no FUA.) SQL Server log and checkpoint made durable with calls to FlushFileBuffers.
|
Common File Systems Used For SQL Server Deployments
ntfs | Windows New Technology File System |
refs | Windows Resilient File System |
xfs | High performance, 64-bit journaling file system for Linux |
ext4 | Fourth extended file system with journaling for Linux |
Supported platforms and file systems for SQL Server on Linux
Handling of Fua
The Linux Host Extension(HE) and SQLPAL work together to translate Win32 CreateFile, WriteFile, ReadFile and other I/O calls into Linux system calls. SQL Server opens data and log files using CreateFile with FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH. The flags are converted to the Linux open system call using the O_DIRECT and O_DSYNC flags, respectively.
When SQL Server performs a write, using a file opened with FILE_FLAG_WRITE_THROUGH(O_DSYNC), Linux performs the multi-step operation as a data write followed by the complete device flush request instead of adding the Fua bit to the data write request.
- Issue write to block device for the data
- Receive the data write complete event in the block device’s write completion routine
- If O_DSYNC requested, issue block device flush
- Wait for block device flush request completion
- Return control to write caller
Every write, requiring O_DSYNC behavior, also invokes a block device flush on Linux. The repeated use of a write request followed by a flush request may be detrimental to performance and will increase traffic on the I/O bus.
Note: The 4.18 Linux Kernel updates allow the XFS file system and a device supporting Fua writes to avoid the additional block device flush request.
Prior to the Linux Kernel 4.18 updates, Linux could use Fua but only for the file system journaling writes and not data writes. Linux queries the device for the DpoFua support capabilities and when the device reports Fua support (DpoFua = 1) the journaling is optimized to use Fua writes. The data storage always uses the multi-step flush process.
Windows does NOT query the device for support, instead any file opened with FILE_FLAG_WRITE_THROUGH sends the Fua option with the write request where the target device is expected to honor the Fua option. In some instances, the target silently ignores the request. The Windows implementation places the responsibility of making sure the I/O subsystem is Fua compliant on the database and system administrator.
Fua (Forced Unit Access)
Microsoft SQL Server requires Fua compliant storage (stable media) ensuring the data is stored in stable media before a write request returns to SQL Server. For example, if you run SQL Server on an IDE or SATA drive, you may need to disable the physical disk cache to accommodate WAL (durability), requirements. The SQL Server 2017 CU6, forced flush feature, provides durability assistance on Linux systems without Fua.
Note: Disabling the write cache using the Windows, properties dialog may not suffice. Some drives are not compliant with the dialog options and require special utilities to configure the drive cache. Linux may need similar utilities to control various hardware cache installations.
You can explore your system using the df and lsblk commands. When Linux mounts a block device, Linux queries the attributes and capabilities of the block device. One such capability is the ability to support DpoFua for write requests. When DpoFua=1 is reported, the device is reporting support for Fua.
Note: These commands may require root permissions to execute.
df -T
Filesystem Type 1K-blocks Used Available Use% Mounted on
devtmpfs devtmpfs 4062840 0 4062840 0% /dev
tmpfs tmpfs 4078624 0 4078624 0% /dev/shm
tmpfs tmpfs 4078624 8880 4069744 1% /run
tmpfs tmpfs 4078624 0 4078624 0% /sys/fs/cgroup
/dev/mapper/cl-root xfs 52403200 13010400 39392800 25% /
/dev/sda1 xfs 1038336 431168 607168 42% /boot
/dev/mapper/cl-home xfs 114361796 22488816 91872980 20% /home
This post first appeared on MSDN Blogs | Get The Latest Information, Insights, Announcements, And News From Microsoft Experts And Developers In The MSDN Blogs., please read the originial post: here