Linux 内核揭密

The previous part was the first part in the current chapter that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part:

jiffies
clocksource

The first is the global variable that is defined in the include/linux/jiffies.h header file and represents the counter that is increased during each timer interrupt. So if we can access this global variable and we know the timer interrupt rate we can convert jiffies to the human time units. As we already know the timer interrupt rate represented by the compile-time constant that is called HZ in the Linux kernel. The value of HZ is equal to the value of the CONFIG_HZ kernel configuration option and if we will look into the arch/x86/configs/x86_64_defconfig kernel configuration file, we will see that:

CONFIG_HZ_1000=y

kernel configuration option is set. This means that value of CONFIG_HZ will be 1000 by default for the x86_64 architecture. So, if we divide the value of jiffies by the value of HZ:

jiffies / HZ

we will get the amount of seconds that elapsed since the beginning of the moment the Linux kernel started to work or in other words we will get the system uptime. Since HZ represents the amount of timer interrupts in a second, we can set a value for some time in the future. For example:

/* one minute from now */
unsigned long later = jiffies + 60*HZ;

/* five minutes from now */
unsigned long later = jiffies + 5*60*HZ;

This is a very common practice in the Linux kernel. For example, if you will look into the arch/x86/kernel/smpboot.c source code file, you will find the do_boot_cpu function. This function boots all processors besides bootstrap processor. You can find a snippet that waits ten seconds for a response from the application processor:

if (!boot_error) {
    timeout = jiffies + 10*HZ;
    while (time_before(jiffies, timeout)) {
        ...
        ...
        ...
        udelay(100);
    }
    ...
    ...
    ...
}

We assign jiffies + 10*HZ value to the timeout variable here. As I think you already understood, this means a ten seconds timeout. After this we are entering a loop where we use the time_before macro to compare the current jiffies value and our timeout.

Or for example if we look into the sound/isa/sscape.c source code file which represents the driver for the Ensoniq Soundscape Elite sound card, we will see the obp_startup_ack function that waits upto a given timeout for the On-Board Processor to return its start-up acknowledgement sequence:

static int obp_startup_ack(struct soundscape *s, unsigned timeout)
{
    unsigned long end_time = jiffies + msecs_to_jiffies(timeout);

    do {
        ...
        ...
        ...
        x = host_read_unsafe(s->io_base);
        ...
        ...
        ...
        if (x == 0xfe || x == 0xff)
            return 1;
        msleep(10);
    } while (time_before(jiffies, end_time));

    return 0;
}

As you can see, the jiffies variable is very widely used in the Linux kernel code. As I already wrote, we met yet another new time management related concept in the previous part - clocksource. We have only seen a short description of this concept and the API for a clock source registration. Let's take a closer look in this part.

Introduction to `clocksource`

The clocksource concept represents the generic API for clock sources management in the Linux kernel. Why do we need a separate framework for this? Let's go back to the beginning. The time concept is the fundamental concept in the Linux kernel and other operating system kernels. And the timekeeping is one of the necessities to use this concept. For example Linux kernel must know and update the time elapsed since system startup, it must determine how long the current process has been running for every processor and many many more. Where the Linux kernel can get information about time? First of all it is Real Time Clock or RTC that represents by the a nonvolatile device. You can find a set of architecture-independent real time clock drivers in the Linux kernel in the drivers/rtc directory. Besides this, each architecture can provide a driver for the architecture-dependent real time clock, for example - CMOS/RTC - arch/x86/kernel/rtc.c for the x86 architecture. The second is system timer - timer that excites interrupts with a periodic rate. For example, for IBM PC compatibles it was - programmable interval timer.

We already know that for timekeeping purposes we can use jiffies in the Linux kernel. The jiffies can be considered as read only global variable which is updated with HZ frequency. We know that the HZ is a compile-time kernel parameter whose reasonable range is from 100 to 1000 Hz. So, it is guaranteed to have an interface for time measurement with 1 - 10 milliseconds resolution. Besides standard jiffies, we saw the refined_jiffies clock source in the previous part that is based on the i8253/i8254 programmable interval timer tick rate which is almost 1193182 hertz. So we can get something about 1 microsecond resolution with the refined_jiffies. In this time, nanoseconds are the favorite choice for the time value units of the given clock source.

The availability of more precise techniques for time intervals measurement is hardware-dependent. We just knew a little about x86 dependent timers hardware. But each architecture provides own timers hardware. Earlier each architecture had own implementation for this purpose. Solution of this problem is an abstraction layer and associated API in a common code framework for managing various clock sources and independent of the timer interrupt. This common code framework became - clocksource framework.

Generic timeofday and clock source management framework moved a lot of timekeeping code into the architecture independent portion of the code, with the architecture-dependent portion reduced to defining and managing low-level hardware pieces of clocksources. It takes a large amount of funds to measure the time interval on different architectures with different hardware, and it is very complex. Implementation of the each clock related service is strongly associated with an individual hardware device and as you can understand, it results in similar implementations for different architectures.

Within this framework, each clock source is required to maintain a representation of time as a monotonically increasing value. As we can see in the Linux kernel code, nanoseconds are the favorite choice for the time value units of a clock source in this time. One of the main point of the clock source framework is to allow an user to select clock source among a range of available hardware devices supporting clock functions when configuring the system and selecting, accessing and scaling different clock sources.

The clocksource structure

The fundamental of the clocksource framework is the clocksource structure that defined in the include/linux/clocksource.h header file. We already saw some fields that are provided by the clocksource structure in the previous part. Let's look on the full definition of this structure and try to describe all of its fields:

struct clocksource {
    cycle_t (*read)(struct clocksource *cs);
    cycle_t mask;
    u32 mult;
    u32 shift;
    u64 max_idle_ns;
    u32 maxadj;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
    struct arch_clocksource_data archdata;
#endif
    u64 max_cycles;
    const char *name;
    struct list_head list;
    int rating;
    int (*enable)(struct clocksource *cs);
    void (*disable)(struct clocksource *cs);
    unsigned long flags;
    void (*suspend)(struct clocksource *cs);
    void (*resume)(struct clocksource *cs);
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
    struct list_head wd_list;
    cycle_t cs_last;
    cycle_t wd_last;
#endif
    struct module *owner;
} ____cacheline_aligned;

We already saw the first field of the clocksource structure in the previous part - it is pointer to the read function that returns best counter selected by the clocksource framework. For example we use jiffies_read function to read jiffies value:

static struct clocksource clocksource_jiffies = {
    ...
    .read       = jiffies_read,
    ...
}

where jiffies_read just returns:

static cycle_t jiffies_read(struct clocksource *cs)
{
    return (cycle_t) jiffies;
}

Or the read_tsc function:

static struct clocksource clocksource_tsc = {
    ...
    .read                   = read_tsc,
    ...
};

for the time stamp counter reading.

The next field is mask that allows to ensure that subtraction between counters values from non 64 bit counters do not need special overflow logic. After the mask field, we can see two fields: mult and shift. These are the fields that are base of mathematical functions that are provide ability to convert time values specific to each clock source. In other words these two fields help us to convert an abstract machine time units of a counter to nanoseconds.

After these two fields we can see the 64 bits max_idle_ns field represents max idle time permitted by the clocksource in nanoseconds. We need in this field for the Linux kernel with enabled CONFIG_NO_HZ kernel configuration option. This kernel configuration option enables the Linux kernel to run without a regular timer tick (we will see full explanation of this in other part). The problem that dynamic tick allows the kernel to sleep for periods longer than a single tick, moreover sleep time could be unlimited. The max_idle_ns field represents this sleeping limit.

The next field after the max_idle_ns is the maxadj field which is the maximum adjustment value to mult. The main formula by which we convert cycles to the nanoseconds:

((u64) cycles * mult) >> shift;

is not 100% accurate. Instead the number is taken as close as possible to a nanosecond and maxadj helps to correct this and allows clocksource API to avoid mult values that might overflow when adjusted. The next four fields are pointers to the function:

enable - optional function to enable clocksource;
disable - optional function to disable clocksource;
suspend - suspend function for the clocksource;
resume - resume function for the clocksource;

The next field is the max_cycles and as we can understand from its name, this field represents maximum cycle value before potential overflow. And the last field is owner represents reference to a kernel module that is owner of a clocksource. This is all. We just went through all the standard fields of the clocksource structure. But you can noted that we missed some fields of the clocksource structure. We can divide all of missed field on two types: Fields of the first type are already known for us. For example, they are name field that represents name of a clocksource, the rating field that helps to the Linux kernel to select the best clocksource and etc. The second type, fields which are dependent from the different Linux kernel configuration options. Let's look on these fields.

The first field is the archdata. This field has arch_clocksource_data type and depends on the CONFIG_ARCH_CLOCKSOURCE_DATA kernel configuration option. This field is actual only for the x86 and IA64 architectures for this moment. And again, as we can understand from the field's name, it represents architecture-specific data for a clock source. For example, it represents vDSO clock mode:

struct arch_clocksource_data {
    int vclock_mode;
};

for the x86 architectures. Where the vDSO clock mode can be one of the:

#define VCLOCK_NONE 0
#define VCLOCK_TSC  1
#define VCLOCK_HPET 2
#define VCLOCK_PVCLOCK 3

The last three fields are wd_list, cs_last and the wd_last depends on the CONFIG_CLOCKSOURCE_WATCHDOG kernel configuration option. First of all let's try to understand what is it watchdog. In a simple words, watchdog is a timer that is used for detection of the computer malfunctions and recovering from it. All of these three fields contain watchdog related data that is used by the clocksource framework. If we will grep the Linux kernel source code, we will see that only arch/x86/KConfig kernel configuration file contains the CONFIG_CLOCKSOURCE_WATCHDOG kernel configuration option. So, why do x86 and x86_64 need in watchdog? You already may know that all x86 processors has special 64-bit register - time stamp counter. This register contains number of cycles since the reset. Sometimes the time stamp counter needs to be verified against another clock source. We will not see initialization of the watchdog timer in this part, before this we must learn more about timers.

That's all. From this moment we know all fields of the clocksource structure. This knowledge will help us to learn insides of the clocksource framework.

New clock source registration

We saw only one function from the clocksource framework in the previous part. This function was - __clocksource_register. This function defined in the include/linux/clocksource.h header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the __clocksource_register function, we will see that it just makes call of the __clocksource_register_scale function and returns its result:

static inline int __clocksource_register(struct clocksource *cs)
{
    return __clocksource_register_scale(cs, 1, 0);
}

Before we will see implementation of the __clocksource_register_scale function, we can see that clocksource provides additional API for a new clock source registration:

static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
{
        return __clocksource_register_scale(cs, 1, hz);
}

static inline int clocksource_register_khz(struct clocksource *cs, u32 khz)
{
        return __clocksource_register_scale(cs, 1000, khz);
}

And all of these functions do the same. They return value of the __clocksource_register_scale function but with different set of parameters. The __clocksource_register_scale function defined in the kernel/time/clocksource.c source code file. To understand difference between these functions, let's look on the parameters of the clocksource_register_khz function. As we can see, this function takes three parameters:

cs - clocksource to be installed;
scale - scale factor of a clock source. In other words, if we will multiply value of this parameter on frequency, we will get hz of a clocksource;
freq - clock source frequency divided by scale.

Now let's look on the implementation of the __clocksource_register_scale function:

int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
{
        __clocksource_update_freq_scale(cs, scale, freq);
        mutex_lock(&clocksource_mutex);
        clocksource_enqueue(cs);
        clocksource_enqueue_watchdog(cs);
        clocksource_select();
        mutex_unlock(&clocksource_mutex);
        return 0;
}

First of all we can see that the __clocksource_register_scale function starts from the call of the __clocksource_update_freq_scale function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as zero, we need to calculate mult and shift parameters for the given clock source. Why do we need to check value of the frequency? Actually it can be zero. if you attentively looked on the implementation of the __clocksource_register function, you may have noticed that we passed frequency as 0. We will do it only for some clock sources that have self defined mult and shift parameters. Look in the previous part and you will see that we saw calculation of the mult and shift for jiffies. The __clocksource_update_freq_scale function will do it for us for other clock sources.

So in the start of the __clocksource_update_freq_scale function we check the value of the frequency parameter and if is not zero we need to calculate mult and shift for the given clock source. Let's look on the mult and shift calculation:

void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq)
{
        u64 sec;

        if (freq) {
             sec = cs->mask;
             do_div(sec, freq);
             do_div(sec, scale);

             if (!sec)
                   sec = 1;
             else if (sec > 600 && cs->mask > UINT_MAX)
                   sec = 600;

             clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
                                    NSEC_PER_SEC / scale, sec * scale);
        }
        ...
        ...
        ...
}

Here we can see calculation of the maximum number of seconds which we can run before a clock source counter will overflow. First of all we fill the sec variable with the value of a clock source mask. Remember that a clock source's mask represents maximum amount of bits that are valid for the given clock source. After this, we can see two division operations. At first we divide our sec variable on a clock source frequency and then on scale factor. The freq parameter shows us how many timer interrupts will be occurred in one second. So, we divide mask value that represents maximum number of a counter (for example jiffy) on the frequency of a timer and will get the maximum number of seconds for the certain clock source. The second division operation will give us maximum number of seconds for the certain clock source depends on its scale factor which can be 1 hertz or 1 kilohertz (10^ Hz).

After we have got maximum number of seconds, we check this value and set it to 1 or 600 depends on the result at the next step. These values is maximum sleeping time for a clocksource in seconds. In the next step we can see call of the clocks_calc_mult_shift. Main point of this function is calculation of the mult and shift values for a given clock source. In the end of the __clocksource_update_freq_scale function we check that just calculated mult value of a given clock source will not cause overflow after adjustment, update the max_idle_ns and max_cycles values of a given clock source with the maximum nanoseconds that can be converted to a clock source counter and print result to the kernel buffer:

pr_info("%s: mask: 0x%llx max_cycles: 0x%llx, max_idle_ns: %lld ns\n",
    cs->name, cs->mask, cs->max_cycles, cs->max_idle_ns);

that we can see in the dmesg output:

$ dmesg | grep "clocksource:"
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[    0.094084] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.205302] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.452979] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x7350b459580, max_idle_ns: 881591204237 ns

After the __clocksource_update_freq_scale function will finish its work, we can return back to the __clocksource_register_scale function that will register new clock source. We can see the call of the following three functions:

mutex_lock(&clocksource_mutex);
clocksource_enqueue(cs);
clocksource_enqueue_watchdog(cs);
clocksource_select();
mutex_unlock(&clocksource_mutex);

Note that before the first will be called, we lock the clocksource_mutex mutex. The point of the clocksource_mutex mutex is to protect curr_clocksource variable which represents currently selected clocksource and clocksource_list variable which represents list that contains registered clocksources. Now, let's look on these three functions.

The first clocksource_enqueue function and other two defined in the same source code file. We go through all already registered clocksources or in other words we go through all elements of the clocksource_list and tries to find best place for a given clocksource:

static void clocksource_enqueue(struct clocksource *cs)
{
    struct list_head *entry = &clocksource_list;
    struct clocksource *tmp;

    list_for_each_entry(tmp, &clocksource_list, list)
        if (tmp->rating >= cs->rating)
            entry = &tmp->list;
    list_add(&cs->list, entry);
}

In the end we just insert new clocksource to the clocksource_list. The second function - clocksource_enqueue_watchdog does almost the same that previous function, but it inserts new clock source to the wd_list depends on flags of a clock source and starts new watchdog timer. As I already wrote, we will not consider watchdog related stuff in this part but will do it in next parts.

The last function is the clocksource_select. As we can understand from the function's name, main point of this function - select the best clocksource from registered clocksources. This function consists only from the call of the function helper:

static void clocksource_select(void)
{
    return __clocksource_select(false);
}

Note that the __clocksource_select function takes one parameter (false in our case). This bool parameter shows how to traverse the clocksource_list. In our case we pass false that is meant that we will go through all entries of the clocksource_list. We already know that clocksource with the best rating will the first in the clocksource_list after the call of the clocksource_enqueue function, so we can easily get it from this list. After we found a clock source with the best rating, we switch to it:

if (curr_clocksource != best && !timekeeping_notify(best)) {
    pr_info("Switched to clocksource %s\n", best->name);
    curr_clocksource = best;
}

The result of this operation we can see in the dmesg output:

$ dmesg | grep Switched
[    0.199688] clocksource: Switched to clocksource hpet
[    2.452966] clocksource: Switched to clocksource tsc

Note that we can see two clock sources in the dmesg output (hpet and tsc in our case). Yes, actually there can be many different clock sources on a particular hardware. So the Linux kernel knows about all registered clock sources and switches to a clock source with a better rating each time after registration of a new clock source.

If we will look on the bottom of the kernel/time/clocksource.c source code file, we will see that it has sysfs interface. Main initialization occurs in the init_clocksource_sysfs function which will be called during device initcalls. Let's look on the implementation of the init_clocksource_sysfs function:

static struct bus_type clocksource_subsys = {
    .name = "clocksource",
    .dev_name = "clocksource",
};

static int __init init_clocksource_sysfs(void)
{
    int error = subsys_system_register(&clocksource_subsys, NULL);

    if (!error)
        error = device_register(&device_clocksource);
    if (!error)
        error = device_create_file(
                &device_clocksource,
                &dev_attr_current_clocksource);
    if (!error)
        error = device_create_file(&device_clocksource,
                       &dev_attr_unbind_clocksource);
    if (!error)
        error = device_create_file(
                &device_clocksource,
                &dev_attr_available_clocksource);
    return error;
}
device_initcall(init_clocksource_sysfs);

First of all we can see that it registers a clocksource subsystem with the call of the subsys_system_register function. In other words, after the call of this function, we will have following directory:

$ pwd
/sys/devices/system/clocksource

After this step, we can see registration of the device_clocksource device which is represented by the following structure:

static struct device device_clocksource = {
    .id = 0,
    .bus    = &clocksource_subsys,
};

and creation of three files:

dev_attr_current_clocksource;
dev_attr_unbind_clocksource;
dev_attr_available_clocksource.

These files will provide information about current clock source in the system, available clock sources in the system and interface which allows to unbind the clock source.

After the init_clocksource_sysfs function will be executed, we will be able find some information about available clock sources in the:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
tsc hpet acpi_pm

Or for example information about current clock source in the system:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
tsc

In the previous part, we saw API for the registration of the jiffies clock source, but didn't dive into details about the clocksource framework. In this part we did it and saw implementation of the new clock source registration and selection of a clock source with the best rating value in the system. Of course, this is not all API that clocksource framework provides. There a couple additional functions like clocksource_unregister for removing given clock source from the clocksource_list and etc. But I will not describe this functions in this part, because they are not important for us right now. Anyway if you are interesting in it, you can find it in the kernel/time/clocksource.c.

That's all.

Conclusion

This is the end of the second part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the following two concepts: jiffies and clocksource. In this part we saw some examples of the jiffies usage and knew more details about the clocksource concept.

If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.

Linux 内核揭密

Timers and time management in the Linux kernel. Part 2.

Introduction to the clocksource framework

Introduction to clocksource

The clocksource structure

New clock source registration

Conclusion

Links

Introduction to the `clocksource` framework

Introduction to `clocksource`