Currently there are two mechanisms for handling guest MMIO/PIO accesses in KVM: returning KVM_EXIT_MMIO/KVM_EXIT_IO from ioctl(KVM_RUN) and ioeventfd. In the first case KVM exits back to qemu and then forward the access to emulated device. The traditional dispatch mechanism looks like this:
kvm.ko <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device task
In the second case ioeventfd mechanism can be used for the posted doorbell writes. A guest write in the registered address will signal the provided event instead of triggering an exit. This allows host to be notified in a lightweight way (this is called a «lightweight vmexit»). This is suitable for triggers which want to transmit a notify asynchronously and return as quickly as possible. ioeventfd can be also dispatched through QEMU (using KVM_EXIT_MMIO/KVM_EXIT_IO from ioctl(KVM_RUN)) when kvm_eventfds_allowed is false. This will lead to a lower performance. The benchmarking shows that using KVM ioeventfd is about 30+% faster.
ioregionfd mechanism is suggested to be used for faster in-kernel device dispatching. The control plane is KVM vm ioctl(KVM_SET_IOREGION) for registering MMIO/PIO regions. ioctl(KVM_SET_IOREGION) has to be provided with read/write file descriptors which will be used by wire protocol for communication. ioregionfd registered regions should not be overlapping and should not overlap with ioeventfd. Only one mechanism handles a MMIO/PIO access. Regions can be deleted by setting fd to -1.
struct kvm_ioregion {
__u64 guest_paddr; /* guest physical address */
__u64 memory_size; /* bytes */
__u64 user_data;
__s32 rfd;
__s32 wfd;
__u32 flags;
__u8 pad[28];
};
The data plane is a bi-directional message protocol (wire protocol) that ioregionfd uses to communicate with emulated device. The device reads commands from the file descriptor with the following layout:
struct ioregionfd_cmd {
__u32 info;
__u32 padding;
__u64 user_data;
__u64 offset;
__u64 data;
};
The info field layout is as follows::
bits: | 31 ... 8 | 6 | 5 ... 4 | 3 ... 0 |
field: | reserved | resp | size | cmd |
Thus a device emulation task can use a run loop with the following code:
switch (cmd.info & IOREGIONFD_CMD_MASK) {
case IOREGIONFD_CMD_READ:
/* It's a read access */
break;
case IOREGIONFD_CMD_WRITE:
/* It's a write access */
break;
default:
/* Protocol violation, terminate connection */
}
ioregionfd improves performance by eliminating the need for the vCPU task to forward MMIO/PIO exits to device emulation tasks:
kvm.ko <---------ioctl(KVM_RUN)-------> VMM vCPU task
^
ioregionfd -------------------------------> device task
ioregionfd API design discussions can be found here and here.