diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9a53c92..a70383d 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3528,6 +3528,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted. sched_debug [KNL] Enables verbose scheduler debug messages. + schedstats= [KNL,X86] Enable or disable scheduled statistics. + Allowed values are enable and disable. This feature + incurs a small amount of overhead in the scheduler + but is useful for debugging and performance tuning. + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set. @@ -4054,6 +4059,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. HIGHMEM regardless of setting of CONFIG_HIGHPTE. + uuid_debug= (Boolean) whether to enable debugging of TuxOnIce's + uuid support. + vdso= [X86,SH] On X86_32, this is an alias for vdso32=. Otherwise: diff --git b/Documentation/power/tuxonice-internals.txt b/Documentation/power/tuxonice-internals.txt new file mode 100644 index 0000000..0c6a216 --- /dev/null +++ b/Documentation/power/tuxonice-internals.txt @@ -0,0 +1,532 @@ + TuxOnIce 4.0 Internal Documentation. + Updated to 23 March 2015 + +(Please note that incremental image support mentioned in this document is work +in progress. This document may need updating prior to the actual release of +4.0!) + +1. Introduction. + + TuxOnIce 4.0 is an addition to the Linux Kernel, designed to + allow the user to quickly shutdown and quickly boot a computer, without + needing to close documents or programs. It is equivalent to the + hibernate facility in some laptops. This implementation, however, + requires no special BIOS or hardware support. + + The code in these files is based upon the original implementation + prepared by Gabor Kuti and additional work by Pavel Machek and a + host of others. This code has been substantially reworked by Nigel + Cunningham, again with the help and testing of many others, not the + least of whom are Bernard Blackham and Michael Frank. At its heart, + however, the operation is essentially the same as Gabor's version. + +2. Overview of operation. + + The basic sequence of operations is as follows: + + a. Quiesce all other activity. + b. Ensure enough memory and storage space are available, and attempt + to free memory/storage if necessary. + c. Allocate the required memory and storage space. + d. Write the image. + e. Power down. + + There are a number of complicating factors which mean that things are + not as simple as the above would imply, however... + + o The activity of each process must be stopped at a point where it will + not be holding locks necessary for saving the image, or unexpectedly + restart operations due to something like a timeout and thereby make + our image inconsistent. + + o It is desirous that we sync outstanding I/O to disk before calculating + image statistics. This reduces corruption if one should suspend but + then not resume, and also makes later parts of the operation safer (see + below). + + o We need to get as close as we can to an atomic copy of the data. + Inconsistencies in the image will result in inconsistent memory contents at + resume time, and thus in instability of the system and/or file system + corruption. This would appear to imply a maximum image size of one half of + the amount of RAM, but we have a solution... (again, below). + + o In 2.6 and later, we choose to play nicely with the other suspend-to-disk + implementations. + +3. Detailed description of internals. + + a. Quiescing activity. + + Safely quiescing the system is achieved using three separate but related + aspects. + + First, we use the vanilla kerne's support for freezing processes. This code + is based on the observation that the vast majority of processes don't need + to run during suspend. They can be 'frozen'. The kernel therefore + implements a refrigerator routine, which processes enter and in which they + remain until the cycle is complete. Processes enter the refrigerator via + try_to_freeze() invocations at appropriate places. A process cannot be + frozen in any old place. It must not be holding locks that will be needed + for writing the image or freezing other processes. For this reason, + userspace processes generally enter the refrigerator via the signal + handling code, and kernel threads at the place in their event loops where + they drop locks and yield to other processes or sleep. The task of freezing + processes is complicated by the fact that there can be interdependencies + between processes. Freezing process A before process B may mean that + process B cannot be frozen, because it stops at waiting for process A + rather than in the refrigerator. This issue is seen where userspace waits + on freezeable kernel threads or fuse filesystem threads. To address this + issue, we implement the following algorithm for quiescing activity: + + - Freeze filesystems (including fuse - userspace programs starting + new requests are immediately frozen; programs already running + requests complete their work before being frozen in the next + step) + - Freeze userspace + - Thaw filesystems (this is safe now that userspace is frozen and no + fuse requests are outstanding). + - Invoke sys_sync (noop on fuse). + - Freeze filesystems + - Freeze kernel threads + + If we need to free memory, we thaw kernel threads and filesystems, but not + userspace. We can then free caches without worrying about deadlocks due to + swap files being on frozen filesystems or such like. + + b. Ensure enough memory & storage are available. + + We have a number of constraints to meet in order to be able to successfully + suspend and resume. + + First, the image will be written in two parts, described below. One of + these parts needs to have an atomic copy made, which of course implies a + maximum size of one half of the amount of system memory. The other part + ('pageset') is not atomically copied, and can therefore be as large or + small as desired. + + Second, we have constraints on the amount of storage available. In these + calculations, we may also consider any compression that will be done. The + cryptoapi module allows the user to configure an expected compression ratio. + + Third, the user can specify an arbitrary limit on the image size, in + megabytes. This limit is treated as a soft limit, so that we don't fail the + attempt to suspend if we cannot meet this constraint. + + c. Allocate the required memory and storage space. + + Having done the initial freeze, we determine whether the above constraints + are met, and seek to allocate the metadata for the image. If the constraints + are not met, or we fail to allocate the required space for the metadata, we + seek to free the amount of memory that we calculate is needed and try again. + We allow up to four iterations of this loop before aborting the cycle. If + we do fail, it should only be because of a bug in TuxOnIce's calculations + or the vanilla kernel code for freeing memory. + + These steps are merged together in the prepare_image function, found in + prepare_image.c. The functions are merged because of the cyclical nature + of the problem of calculating how much memory and storage is needed. Since + the data structures containing the information about the image must + themselves take memory and use storage, the amount of memory and storage + required changes as we prepare the image. Since the changes are not large, + only one or two iterations will be required to achieve a solution. + + The recursive nature of the algorithm is miminised by keeping user space + frozen while preparing the image, and by the fact that our records of which + pages are to be saved and which pageset they are saved in use bitmaps (so + that changes in number or fragmentation of the pages to be saved don't + feedback via changes in the amount of memory needed for metadata). The + recursiveness is thus limited to any extra slab pages allocated to store the + extents that record storage used, and the effects of seeking to free memory. + + d. Write the image. + + We previously mentioned the need to create an atomic copy of the data, and + the half-of-memory limitation that is implied in this. This limitation is + circumvented by dividing the memory to be saved into two parts, called + pagesets. + + Pageset2 contains most of the page cache - the pages on the active and + inactive LRU lists that aren't needed or modified while TuxOnIce is + running, so they can be safely written without an atomic copy. They are + therefore saved first and reloaded last. While saving these pages, + TuxOnIce carefully ensures that the work of writing the pages doesn't make + the image inconsistent. With the support for Kernel (Video) Mode Setting + going into the kernel at the time of writing, we need to check for pages + on the LRU that are used by KMS, and exclude them from pageset2. They are + atomically copied as part of pageset 1. + + Once pageset2 has been saved, we prepare to do the atomic copy of remaining + memory. As part of the preparation, we power down drivers, thereby providing + them with the opportunity to have their state recorded in the image. The + amount of memory allocated by drivers for this is usually negligible, but if + DRI is in use, video drivers may require significants amounts. Ideally we + would be able to query drivers while preparing the image as to the amount of + memory they will need. Unfortunately no such mechanism exists at the time of + writing. For this reason, TuxOnIce allows the user to set an + 'extra_pages_allowance', which is used to seek to ensure sufficient memory + is available for drivers at this point. TuxOnIce also lets the user set this + value to 0. In this case, a test driver suspend is done while preparing the + image, and the difference (plus a margin) used instead. TuxOnIce will also + automatically restart the hibernation process (twice at most) if it finds + that the extra pages allowance is not sufficient. It will then use what was + actually needed (plus a margin, again). Failure to hibernate should thus + be an extremely rare occurence. + + Having suspended the drivers, we save the CPU context before making an + atomic copy of pageset1, resuming the drivers and saving the atomic copy. + After saving the two pagesets, we just need to save our metadata before + powering down. + + As we mentioned earlier, the contents of pageset2 pages aren't needed once + they've been saved. We therefore use them as the destination of our atomic + copy. In the unlikely event that pageset1 is larger, extra pages are + allocated while the image is being prepared. This is normally only a real + possibility when the system has just been booted and the page cache is + small. + + This is where we need to be careful about syncing, however. Pageset2 will + probably contain filesystem meta data. If this is overwritten with pageset1 + and then a sync occurs, the filesystem will be corrupted - at least until + resume time and another sync of the restored data. Since there is a + possibility that the user might not resume or (may it never be!) that + TuxOnIce might oops, we do our utmost to avoid syncing filesystems after + copying pageset1. + + e. Incremental images + + TuxOnIce 4.0 introduces a new incremental image mode which changes things a + little. When incremental images are enabled, we save a 'normal' image the + first time we hibernate. One resume however, we do not free the image or + the associated storage. Instead, it is retained until the next attempt at + hibernating and a mechanism is enabled which is used to track which pages + of memory are modified between the two cycles. The modified pages can then + be added to the existing image, rather than unmodified pages being saved + again unnecessarily. + + Incremental image support is available in 64 bit Linux only, due to the + requirement for extra page flags. + + This support is accomplished in the following way: + + 1) Tracking of pages. + + The tracking of changed pages is accomplished using the page fault + mechanism. When we reach a point at which we want to start tracking + changes, most pages are marked read-only and also flagged as being + read-only because of this support. Since this cannot happen for every page + of RAM, some are marked as untracked and always treated as modified whn + preparing an incremental iamge. When a process attempts to modify a page + that is marked read-only in this way, a page fault occurs, with TuxOnIce + code marking the page writable and dirty before allowing the write to + continue. In this way, the effect of incremental images on performance is + minimised - a page only causes a fault once. Small modifications to the + page allocator further reduce the number of faults that occur - free pages + are not tracked; they are made writable and marked as dirty as part of + being allocated. + + 2) Saving the incremental image / atomicity. + + The page fault mechanism is also used to improve the means by which + atomicity of the image is acheived. When it is time to do an atomic copy, + the flags for pages are reset, with the result being that it is no longer + necessary for us to do an atomic of pageset1. Instead, we normally write + the uncopied pages to disk. When an attempt is made to modify a page that + has not yet been saved, the page-fault mechanism makes a copy of the page + prior to allowing the write. This copy is then written to disk. Likewise, + on resume, if a process attempts to write to a page that has been read + while the rest of the image is still being loaded, a copy of that page is + made prior to the write being allowed. At the end of loading the image, + modified pages can thus be restored to their 'atomic copy' contents prior + to restarting normal operation. We also mark pages that are yet to be read + as invalid PFNs, so that we can capture as a bug any attempt by a + half-restored kernel to access a page that hasn't yet been reloaded. + + f. Power down. + + Powering down uses standard kernel routines. TuxOnIce supports powering down + using the ACPI S3, S4 and S5 methods or the kernel's non-ACPI power-off. + Supporting suspend to ram (S3) as a power off option might sound strange, + but it allows the user to quickly get their system up and running again if + the battery doesn't run out (we just need to re-read the overwritten pages) + and if the battery does run out (or the user removes power), they can still + resume. + +4. Data Structures. + + TuxOnIce uses three main structures to store its metadata and configuration + information: + + a) Pageflags bitmaps. + + TuxOnIce records which pages will be in pageset1, pageset2, the destination + of the atomic copy and the source of the atomically restored image using + bitmaps. The code used is that written for swsusp, with small improvements + to match TuxOnIce's requirements. + + The pageset1 bitmap is thus easily stored in the image header for use at + resume time. + + As mentioned above, using bitmaps also means that the amount of memory and + storage required for recording the above information is constant. This + greatly simplifies the work of preparing the image. In earlier versions of + TuxOnIce, extents were used to record which pages would be stored. In that + case, however, eating memory could result in greater fragmentation of the + lists of pages, which in turn required more memory to store the extents and + more storage in the image header. These could in turn require further + freeing of memory, and another iteration. All of this complexity is removed + by having bitmaps. + + Bitmaps also make a lot of sense because TuxOnIce only ever iterates + through the lists. There is therefore no cost to not being able to find the + nth page in order 0 time. We only need to worry about the cost of finding + the n+1th page, given the location of the nth page. Bitwise optimisations + help here. + + b) Extents for block data. + + TuxOnIce supports writing the image to multiple block devices. In the case + of swap, multiple partitions and/or files may be in use, and we happily use + them all (with the exception of compcache pages, which we allocate but do + not use). This use of multiple block devices is accomplished as follows: + + Whatever the actual source of the allocated storage, the destination of the + image can be viewed in terms of one or more block devices, and on each + device, a list of sectors. To simplify matters, we only use contiguous, + PAGE_SIZE aligned sectors, like the swap code does. + + Since sector numbers on each bdev may well not start at 0, it makes much + more sense to use extents here. Contiguous ranges of pages can thus be + represented in the extents by contiguous values. + + Variations in block size are taken account of in transforming this data + into the parameters for bio submission. + + We can thus implement a layer of abstraction wherein the core of TuxOnIce + doesn't have to worry about which device we're currently writing to or + where in the device we are. It simply requests that the next page in the + pageset or header be written, leaving the details to this lower layer. + The lower layer remembers where in the sequence of devices and blocks each + pageset starts. The header always starts at the beginning of the allocated + storage. + + So extents are: + + struct extent { + unsigned long minimum, maximum; + struct extent *next; + } + + These are combined into chains of extents for a device: + + struct extent_chain { + int size; /* size of the extent ie sum (max-min+1) */ + int allocs, frees; + char *name; + struct extent *first, *last_touched; + }; + + For each bdev, we need to store a little more info (simplified definition): + + struct toi_bdev_info { + struct block_device *bdev; + + char uuid[17]; + dev_t dev_t; + int bmap_shift; + int blocks_per_page; + }; + + The uuid is the main means used to identify the device in the storage + image. This means we can cope with the dev_t representation of a device + changing between saving the image and restoring it, as may happen on some + bioses or in the LVM case. + + bmap_shift and blocks_per_page apply the effects of variations in blocks + per page settings for the filesystem and underlying bdev. For most + filesystems, these are the same, but for xfs, they can have independant + values. + + Combining these two structures together, we have everything we need to + record what devices and what blocks on each device are being used to + store the image, and to submit i/o using bio_submit. + + The last elements in the picture are a means of recording how the storage + is being used. + + We do this first and foremost by implementing a layer of abstraction on + top of the devices and extent chains which allows us to view however many + devices there might be as one long storage tape, with a single 'head' that + tracks a 'current position' on the tape: + + struct extent_iterate_state { + struct extent_chain *chains; + int num_chains; + int current_chain; + struct extent *current_extent; + unsigned long current_offset; + }; + + That is, *chains points to an array of size num_chains of extent chains. + For the filewriter, this is always a single chain. For the swapwriter, the + array is of size MAX_SWAPFILES. + + current_chain, current_extent and current_offset thus point to the current + index in the chains array (and into a matching array of struct + suspend_bdev_info), the current extent in that chain (to optimise access), + and the current value in the offset. + + The image is divided into three parts: + - The header + - Pageset 1 + - Pageset 2 + + The header always starts at the first device and first block. We know its + size before we begin to save the image because we carefully account for + everything that will be stored in it. + + The second pageset (LRU) is stored first. It begins on the next page after + the end of the header. + + The first pageset is stored second. It's start location is only known once + pageset2 has been saved, since pageset2 may be compressed as it is written. + This location is thus recorded at the end of saving pageset2. It is page + aligned also. + + Since this information is needed at resume time, and the location of extents + in memory will differ at resume time, this needs to be stored in a portable + way: + + struct extent_iterate_saved_state { + int chain_num; + int extent_num; + unsigned long offset; + }; + + We can thus implement a layer of abstraction wherein the core of TuxOnIce + doesn't have to worry about which device we're currently writing to or + where in the device we are. It simply requests that the next page in the + pageset or header be written, leaving the details to this layer, and + invokes the routines to remember and restore the position, without having + to worry about the details of how the data is arranged on disk or such like. + + c) Modules + + One aim in designing TuxOnIce was to make it flexible. We wanted to allow + for the implementation of different methods of transforming a page to be + written to disk and different methods of getting the pages stored. + + In early versions (the betas and perhaps Suspend1), compression support was + inlined in the image writing code, and the data structures and code for + managing swap were intertwined with the rest of the code. A number of people + had expressed interest in implementing image encryption, and alternative + methods of storing the image. + + In order to achieve this, TuxOnIce was given a modular design. + + A module is a single file which encapsulates the functionality needed + to transform a pageset of data (encryption or compression, for example), + or to write the pageset to a device. The former type of module is called + a 'page-transformer', the later a 'writer'. + + Modules are linked together in pipeline fashion. There may be zero or more + page transformers in a pipeline, and there is always exactly one writer. + The pipeline follows this pattern: + + --------------------------------- + | TuxOnIce Core | + --------------------------------- + | + | + --------------------------------- + | Page transformer 1 | + --------------------------------- + | + | + --------------------------------- + | Page transformer 2 | + --------------------------------- + | + | + --------------------------------- + | Writer | + --------------------------------- + + During the writing of an image, the core code feeds pages one at a time + to the first module. This module performs whatever transformations it + implements on the incoming data, completely consuming the incoming data and + feeding output in a similar manner to the next module. + + All routines are SMP safe, and the final result of the transformations is + written with an index (provided by the core) and size of the output by the + writer. As a result, we can have multithreaded I/O without needing to + worry about the sequence in which pages are written (or read). + + During reading, the pipeline works in the reverse direction. The core code + calls the first module with the address of a buffer which should be filled. + (Note that the buffer size is always PAGE_SIZE at this time). This module + will in turn request data from the next module and so on down until the + writer is made to read from the stored image. + + Part of definition of the structure of a module thus looks like this: + + int (*rw_init) (int rw, int stream_number); + int (*rw_cleanup) (int rw); + int (*write_chunk) (struct page *buffer_page); + int (*read_chunk) (struct page *buffer_page, int sync); + + It should be noted that the _cleanup routine may be called before the + full stream of data has been read or written. While writing the image, + the user may (depending upon settings) choose to abort suspending, and + if we are in the midst of writing the last portion of the image, a portion + of the second pageset may be reread. This may also happen if an error + occurs and we seek to abort the process of writing the image. + + The modular design is also useful in a number of other ways. It provides + a means where by we can add support for: + + - providing overall initialisation and cleanup routines; + - serialising configuration information in the image header; + - providing debugging information to the user; + - determining memory and image storage requirements; + - dis/enabling components at run-time; + - configuring the module (see below); + + ...and routines for writers specific to their work: + - Parsing a resume= location; + - Determining whether an image exists; + - Marking a resume as having been attempted; + - Invalidating an image; + + Since some parts of the core - the user interface and storage manager + support - have use for some of these functions, they are registered as + 'miscellaneous' modules as well. + + d) Sysfs data structures. + + This brings us naturally to support for configuring TuxOnIce. We desired to + provide a way to make TuxOnIce as flexible and configurable as possible. + The user shouldn't have to reboot just because they want to now hibernate to + a file instead of a partition, for example. + + To accomplish this, TuxOnIce implements a very generic means whereby the + core and modules can register new sysfs entries. All TuxOnIce entries use + a single _store and _show routine, both of which are found in + tuxonice_sysfs.c in the kernel/power directory. These routines handle the + most common operations - getting and setting the values of bits, integers, + longs, unsigned longs and strings in one place, and allow overrides for + customised get and set options as well as side-effect routines for all + reads and writes. + + When combined with some simple macros, a new sysfs entry can then be defined + in just a couple of lines: + + SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1, + 2048, 0, NULL), + + This defines a sysfs entry named "progress_granularity" which is rw and + allows the user to access an integer stored at &progress_granularity, giving + it a value between 1 and 2048 inclusive. + + Sysfs entries are registered under /sys/power/tuxonice, and entries for + modules are located in a subdirectory named after the module. + diff --git b/Documentation/power/tuxonice.txt b/Documentation/power/tuxonice.txt new file mode 100644 index 0000000..3bf0575 --- /dev/null +++ b/Documentation/power/tuxonice.txt @@ -0,0 +1,948 @@ + --- TuxOnIce, version 3.0 --- + +1. What is it? +2. Why would you want it? +3. What do you need to use it? +4. Why not just use the version already in the kernel? +5. How do you use it? +6. What do all those entries in /sys/power/tuxonice do? +7. How do you get support? +8. I think I've found a bug. What should I do? +9. When will XXX be supported? +10 How does it work? +11. Who wrote TuxOnIce? + +1. What is it? + + Imagine you're sitting at your computer, working away. For some reason, you + need to turn off your computer for a while - perhaps it's time to go home + for the day. When you come back to your computer next, you're going to want + to carry on where you left off. Now imagine that you could push a button and + have your computer store the contents of its memory to disk and power down. + Then, when you next start up your computer, it loads that image back into + memory and you can carry on from where you were, just as if you'd never + turned the computer off. You have far less time to start up, no reopening of + applications or finding what directory you put that file in yesterday. + That's what TuxOnIce does. + + TuxOnIce has a long heritage. It began life as work by Gabor Kuti, who, + with some help from Pavel Machek, got an early version going in 1999. The + project was then taken over by Florent Chabaud while still in alpha version + numbers. Nigel Cunningham came on the scene when Florent was unable to + continue, moving the project into betas, then 1.0, 2.0 and so on up to + the present series. During the 2.0 series, the name was contracted to + Suspend2 and the website suspend2.net created. Beginning around July 2007, + a transition to calling the software TuxOnIce was made, to seek to help + make it clear that TuxOnIce is more concerned with hibernation than suspend + to ram. + + Pavel Machek's swsusp code, which was merged around 2.5.17 retains the + original name, and was essentially a fork of the beta code until Rafael + Wysocki came on the scene in 2005 and began to improve it further. + +2. Why would you want it? + + Why wouldn't you want it? + + Being able to save the state of your system and quickly restore it improves + your productivity - you get a useful system in far less time than through + the normal boot process. You also get to be completely 'green', using zero + power, or as close to that as possible (the computer may still provide + minimal power to some devices, so they can initiate a power on, but that + will be the same amount of power as would be used if you told the computer + to shutdown. + +3. What do you need to use it? + + a. Kernel Support. + + i) The TuxOnIce patch. + + TuxOnIce is part of the Linux Kernel. This version is not part of Linus's + 2.6 tree at the moment, so you will need to download the kernel source and + apply the latest patch. Having done that, enable the appropriate options in + make [menu|x]config (under Power Management Options - look for "Enhanced + Hibernation"), compile and install your kernel. TuxOnIce works with SMP, + Highmem, preemption, fuse filesystems, x86-32, PPC and x86_64. + + TuxOnIce patches are available from http://tuxonice.net. + + ii) Compression support. + + Compression support is implemented via the cryptoapi. You will therefore want + to select any Cryptoapi transforms that you want to use on your image from + the Cryptoapi menu while configuring your kernel. We recommend the use of the + LZO compression method - it is very fast and still achieves good compression. + + You can also tell TuxOnIce to write its image to an encrypted and/or + compressed filesystem/swap partition. In that case, you don't need to do + anything special for TuxOnIce when it comes to kernel configuration. + + iii) Configuring other options. + + While you're configuring your kernel, try to configure as much as possible + to build as modules. We recommend this because there are a number of drivers + that are still in the process of implementing proper power management + support. In those cases, the best way to work around their current lack is + to build them as modules and remove the modules while hibernating. You might + also bug the driver authors to get their support up to speed, or even help! + + b. Storage. + + i) Swap. + + TuxOnIce can store the hibernation image in your swap partition, a swap file or + a combination thereof. Whichever combination you choose, you will probably + want to create enough swap space to store the largest image you could have, + plus the space you'd normally use for swap. A good rule of thumb would be + to calculate the amount of swap you'd want without using TuxOnIce, and then + add the amount of memory you have. This swapspace can be arranged in any way + you'd like. It can be in one partition or file, or spread over a number. The + only requirement is that they be active when you start a hibernation cycle. + + There is one exception to this requirement. TuxOnIce has the ability to turn + on one swap file or partition at the start of hibernating and turn it back off + at the end. If you want to ensure you have enough memory to store a image + when your memory is fully used, you might want to make one swap partition or + file for 'normal' use, and another for TuxOnIce to activate & deactivate + automatically. (Further details below). + + ii) Normal files. + + TuxOnIce includes a 'file allocator'. The file allocator can store your + image in a simple file. Since Linux has the concept of everything being a + file, this is more powerful than it initially sounds. If, for example, you + were to set up a network block device file, you could hibernate to a network + server. This has been tested and works to a point, but nbd itself isn't + stateless enough for our purposes. + + Take extra care when setting up the file allocator. If you just type + commands without thinking and then try to hibernate, you could cause + irreversible corruption on your filesystems! Make sure you have backups. + + Most people will only want to hibernate to a local file. To achieve that, do + something along the lines of: + + echo "TuxOnIce" > /hibernation-file + dd if=/dev/zero bs=1M count=512 >> /hibernation-file + + This will create a 512MB file called /hibernation-file. To get TuxOnIce to use + it: + + echo /hibernation-file > /sys/power/tuxonice/file/target + + Then + + cat /sys/power/tuxonice/resume + + Put the results of this into your bootloader's configuration (see also step + C, below): + + ---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE--- + # cat /sys/power/tuxonice/resume + file:/dev/hda2:0x1e001 + + In this example, we would edit the append= line of our lilo.conf|menu.lst + so that it included: + + resume=file:/dev/hda2:0x1e001 + ---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE--- + + For those who are thinking 'Could I make the file sparse?', the answer is + 'No!'. At the moment, there is no way for TuxOnIce to fill in the holes in + a sparse file while hibernating. In the longer term (post merge!), I'd like + to change things so that the file could be dynamically resized and have + holes filled as needed. Right now, however, that's not possible and not a + priority. + + c. Bootloader configuration. + + Using TuxOnIce also requires that you add an extra parameter to + your lilo.conf or equivalent. Here's an example for a swap partition: + + append="resume=swap:/dev/hda1" + + This would tell TuxOnIce that /dev/hda1 is a swap partition you + have. TuxOnIce will use the swap signature of this partition as a + pointer to your data when you hibernate. This means that (in this example) + /dev/hda1 doesn't need to be _the_ swap partition where all of your data + is actually stored. It just needs to be a swap partition that has a + valid signature. + + You don't need to have a swap partition for this purpose. TuxOnIce + can also use a swap file, but usage is a little more complex. Having made + your swap file, turn it on and do + + cat /sys/power/tuxonice/swap/headerlocations + + (this assumes you've already compiled your kernel with TuxOnIce + support and booted it). The results of the cat command will tell you + what you need to put in lilo.conf: + + For swap partitions like /dev/hda1, simply use resume=/dev/hda1. + For swapfile `swapfile`, use resume=swap:/dev/hda2:0x242d. + + If the swapfile changes for any reason (it is moved to a different + location, it is deleted and recreated, or the filesystem is + defragmented) then you will have to check + /sys/power/tuxonice/swap/headerlocations for a new resume_block value. + + Once you've compiled and installed the kernel and adjusted your bootloader + configuration, you should only need to reboot for the most basic part + of TuxOnIce to be ready. + + If you only compile in the swap allocator, or only compile in the file + allocator, you don't need to add the "swap:" part of the resume= + parameters above. resume=/dev/hda2:0x242d will work just as well. If you + have compiled both and your storage is on swap, you can also use this + format (the swap allocator is the default allocator). + + When compiling your kernel, one of the options in the 'Power Management + Support' menu, just above the 'Enhanced Hibernation (TuxOnIce)' entry is + called 'Default resume partition'. This can be used to set a default value + for the resume= parameter. + + d. The hibernate script. + + Since the driver model in 2.6 kernels is still being developed, you may need + to do more than just configure TuxOnIce. Users of TuxOnIce usually start the + process via a script which prepares for the hibernation cycle, tells the + kernel to do its stuff and then restore things afterwards. This script might + involve: + + - Switching to a text console and back if X doesn't like the video card + status on resume. + - Un/reloading drivers that don't play well with hibernation. + + Note that you might not be able to unload some drivers if there are + processes using them. You might have to kill off processes that hold + devices open. Hint: if your X server accesses an USB mouse, doing a + 'chvt' to a text console releases the device and you can unload the + module. + + Check out the latest script (available on tuxonice.net). + + e. The userspace user interface. + + TuxOnIce has very limited support for displaying status if you only apply + the kernel patch - it can printk messages, but that is all. In addition, + some of the functions mentioned in this document (such as cancelling a cycle + or performing interactive debugging) are unavailable. To utilise these + functions, or simply get a nice display, you need the 'userui' component. + Userui comes in three flavours, usplash, fbsplash and text. Text should + work on any console. Usplash and fbsplash require the appropriate + (distro specific?) support. + + To utilise a userui, TuxOnIce just needs to be told where to find the + userspace binary: + + echo "/usr/local/sbin/tuxoniceui_fbsplash" > /sys/power/tuxonice/user_interface/program + + The hibernate script can do this for you, and a default value for this + setting can be configured when compiling the kernel. This path is also + stored in the image header, so if you have an initrd or initramfs, you can + use the userui during the first part of resuming (prior to the atomic + restore) by putting the binary in the same path in your initrd/ramfs. + Alternatively, you can put it in a different location and do an echo + similar to the above prior to the echo > do_resume. The value saved in the + image header will then be ignored. + +4. Why not just use the version already in the kernel? + + The version in the vanilla kernel has a number of drawbacks. The most + serious of these are: + - it has a maximum image size of 1/2 total memory; + - it doesn't allocate storage until after it has snapshotted memory. + This means that you can't be sure hibernating will work until you + see it start to write the image; + - it does not allow you to press escape to cancel a cycle; + - it does not allow you to press escape to cancel resuming; + - it does not allow you to automatically swapon a file when + starting a cycle; + - it does not allow you to use multiple swap partitions or files; + - it does not allow you to use ordinary files; + - it just invalidates an image and continues to boot if you + accidentally boot the wrong kernel after hibernating; + - it doesn't support any sort of nice display while hibernating; + - it is moving toward requiring that you have an initrd/initramfs + to ever have a hope of resuming (uswsusp). While uswsusp will + address some of the concerns above, it won't address all of them, + and will be more complicated to get set up; + - it doesn't have support for suspend-to-both (write a hibernation + image, then suspend to ram; I think this is known as ReadySafe + under M$). + +5. How do you use it? + + A hibernation cycle can be started directly by doing: + + echo > /sys/power/tuxonice/do_hibernate + + In practice, though, you'll probably want to use the hibernate script + to unload modules, configure the kernel the way you like it and so on. + In that case, you'd do (as root): + + hibernate + + See the hibernate script's man page for more details on the options it + takes. + + If you're using the text or splash user interface modules, one feature of + TuxOnIce that you might find useful is that you can press Escape at any time + during hibernating, and the process will be aborted. + + Due to the way hibernation works, this means you'll have your system back and + perfectly usable almost instantly. The only exception is when it's at the + very end of writing the image. Then it will need to reload a small (usually + 4-50MBs, depending upon the image characteristics) portion first. + + Likewise, when resuming, you can press escape and resuming will be aborted. + The computer will then powerdown again according to settings at that time for + the powerdown method or rebooting. + + You can change the settings for powering down while the image is being + written by pressing 'R' to toggle rebooting and 'O' to toggle between + suspending to ram and powering down completely). + + If you run into problems with resuming, adding the "noresume" option to + the kernel command line will let you skip the resume step and recover your + system. This option shouldn't normally be needed, because TuxOnIce modifies + the image header prior to the atomic restore, and will thus prompt you + if it detects that you've tried to resume an image before (this flag is + removed if you press Escape to cancel a resume, so you won't be prompted + then). + + Recent kernels (2.6.24 onwards) add support for resuming from a different + kernel to the one that was hibernated (thanks to Rafael for his work on + this - I've just embraced and enhanced the support for TuxOnIce). This + should further reduce the need for you to use the noresume option. + +6. What do all those entries in /sys/power/tuxonice do? + + /sys/power/tuxonice is the directory which contains files you can use to + tune and configure TuxOnIce to your liking. The exact contents of + the directory will depend upon the version of TuxOnIce you're + running and the options you selected at compile time. In the following + descriptions, names in brackets refer to compile time options. + (Note that they're all dependant upon you having selected CONFIG_TUXONICE + in the first place!). + + Since the values of these settings can open potential security risks, the + writeable ones are accessible only to the root user. You may want to + configure sudo to allow you to invoke your hibernate script as an ordinary + user. + + - alloc/failure_test + + This debugging option provides a way of testing TuxOnIce's handling of + memory allocation failures. Each allocation type that TuxOnIce makes has + been given a unique number (see the source code). Echo the appropriate + number into this entry, and when TuxOnIce attempts to do that allocation, + it will pretend there was a failure and act accordingly. + + - alloc/find_max_mem_allocated + + This debugging option will cause TuxOnIce to find the maximum amount of + memory it used during a cycle, and report that information in debugging + information at the end of the cycle. + + - alt_resume_param + + Instead of powering down after writing a hibernation image, TuxOnIce + supports resuming from a different image. This entry lets you set the + location of the signature for that image (the resume= value you'd use + for it). Using an alternate image and keep_image mode, you can do things + like using an alternate image to power down an uninterruptible power + supply. + + - block_io/target_outstanding_io + + This value controls the amount of memory that the block I/O code says it + needs when the core code is calculating how much memory is needed for + hibernating and for resuming. It doesn't directly control the amount of + I/O that is submitted at any one time - that depends on the amount of + available memory (we may have more available than we asked for), the + throughput that is being achieved and the ability of the CPU to keep up + with disk throughput (particularly where we're compressing pages). + + - checksum/enabled + + Use cryptoapi hashing routines to verify that Pageset2 pages don't change + while we're saving the first part of the image, and to get any pages that + do change resaved in the atomic copy. This should normally not be needed, + but if you're seeing issues, please enable this. If your issues stop you + being able to resume, enable this option, hibernate and cancel the cycle + after the atomic copy is done. If the debugging info shows a non-zero + number of pages resaved, please report this to Nigel. + + - compression/algorithm + + Set the cryptoapi algorithm used for compressing the image. + + - compression/expected_compression + + These values allow you to set an expected compression ratio, which TuxOnice + will use in calculating whether it meets constraints on the image size. If + this expected compression ratio is not attained, the hibernation cycle will + abort, so it is wise to allow some spare. You can see what compression + ratio is achieved in the logs after hibernating. + + - debug_info: + + This file returns information about your configuration that may be helpful + in diagnosing problems with hibernating. + + - did_suspend_to_both: + + This file can be used when you hibernate with powerdown method 3 (ie suspend + to ram after writing the image). There can be two outcomes in this case. We + can resume from the suspend-to-ram before the battery runs out, or we can run + out of juice and and up resuming like normal. This entry lets you find out, + post resume, which way we went. If the value is 1, we resumed from suspend + to ram. This can be useful when actions need to be run post suspend-to-ram + that don't need to be run if we did the normal resume from power off. + + - do_hibernate: + + When anything is written to this file, the kernel side of TuxOnIce will + begin to attempt to write an image to disk and power down. You'll normally + want to run the hibernate script instead, to get modules unloaded first. + + - do_resume: + + When anything is written to this file TuxOnIce will attempt to read and + restore an image. If there is no image, it will return almost immediately. + If an image exists, the echo > will never return. Instead, the original + kernel context will be restored and the original echo > do_hibernate will + return. + + - */enabled + + These option can be used to temporarily disable various parts of TuxOnIce. + + - extra_pages_allowance + + When TuxOnIce does its atomic copy, it calls the driver model suspend + and resume methods. If you have DRI enabled with a driver such as fglrx, + this can result in the driver allocating a substantial amount of memory + for storing its state. Extra_pages_allowance tells TuxOnIce how much + extra memory it should ensure is available for those allocations. If + your attempts at hibernating end with a message in dmesg indicating that + insufficient extra pages were allowed, you need to increase this value. + + - file/target: + + Read this value to get the current setting. Write to it to point TuxOnice + at a new storage location for the file allocator. See section 3.b.ii above + for details of how to set up the file allocator. + + - freezer_test + + This entry can be used to get TuxOnIce to just test the freezer and prepare + an image without actually doing a hibernation cycle. It is useful for + diagnosing freezing and image preparation issues. + + - full_pageset2 + + TuxOnIce divides the pages that are stored in an image into two sets. The + difference between the two sets is that pages in pageset 1 are atomically + copied, and pages in pageset 2 are written to disk without being copied + first. A page CAN be written to disk without being copied first if and only + if its contents will not be modified or used at any time after userspace + processes are frozen. A page MUST be in pageset 1 if its contents are + modified or used at any time after userspace processes have been frozen. + + Normally (ie if this option is enabled), TuxOnIce will put all pages on the + per-zone LRUs in pageset2, then remove those pages used by any userspace + user interface helper and TuxOnIce storage manager that are running, + together with pages used by the GEM memory manager introduced around 2.6.28 + kernels. + + If this option is disabled, a much more conservative approach will be taken. + The only pages in pageset2 will be those belonging to userspace processes, + with the exclusion of those belonging to the TuxOnIce userspace helpers + mentioned above. This will result in a much smaller pageset2, and will + therefore result in smaller images than are possible with this option + enabled. + + - ignore_rootfs + + TuxOnIce records which device is mounted as the root filesystem when + writing the hibernation image. It will normally check at resume time that + this device isn't already mounted - that would be a cause of filesystem + corruption. In some particular cases (RAM based root filesystems), you + might want to disable this check. This option allows you to do that. + + - image_exists: + + Can be used in a script to determine whether a valid image exists at the + location currently pointed to by resume=. Returns up to three lines. + The first is whether an image exists (-1 for unsure, otherwise 0 or 1). + If an image eixsts, additional lines will return the machine and version. + Echoing anything to this entry removes any current image. + + - image_size_limit: + + The maximum size of hibernation image written to disk, measured in megabytes + (1024*1024). + + - last_result: + + The result of the last hibernation cycle, as defined in + include/linux/suspend-debug.h with the values SUSPEND_ABORTED to + SUSPEND_KEPT_IMAGE. This is a bitmask. + + - late_cpu_hotplug: + + This sysfs entry controls whether cpu hotplugging is done - as normal - just + before (unplug) and after (replug) the atomic copy/restore (so that all + CPUs/cores are available for multithreaded I/O). The alternative is to + unplug all secondary CPUs/cores at the start of hibernating/resuming, and + replug them at the end of resuming. No multithreaded I/O will be possible in + this configuration, but the odd machine has been reported to require it. + + - lid_file: + + This determines which ACPI button file we look in to determine whether the + lid is open or closed after resuming from suspend to disk or power off. + If the entry is set to "lid/LID", we'll open /proc/acpi/button/lid/LID/state + and check its contents at the appropriate moment. See post_wake_state below + for more details on how this entry is used. + + - log_everything (CONFIG_PM_DEBUG): + + Setting this option results in all messages printed being logged. Normally, + only a subset are logged, so as to not slow the process and not clutter the + logs. Useful for debugging. It can be toggled during a cycle by pressing + 'L'. + + - no_load_direct: + + This is a debugging option. If, when loading the atomically copied pages of + an image, TuxOnIce finds that the destination address for a page is free, + it will normally allocate the image, load the data directly into that + address and skip it in the atomic restore. If this option is disabled, the + page will be loaded somewhere else and atomically restored like other pages. + + - no_flusher_thread: + + When doing multithreaded I/O (see below), the first online CPU can be used + to _just_ submit compressed pages when writing the image, rather than + compressing and submitting data. This option is normally disabled, but has + been included because Nigel would like to see whether it will be more useful + as the number of cores/cpus in computers increases. + + - no_multithreaded_io: + + TuxOnIce will normally create one thread per cpu/core on your computer, + each of which will then perform I/O. This will generally result in + throughput that's the maximum the storage medium can handle. There + shouldn't be any reason to disable multithreaded I/O now, but this option + has been retained for debugging purposes. + + - no_pageset2 + + See the entry for full_pageset2 above for an explanation of pagesets. + Enabling this option causes TuxOnIce to do an atomic copy of all pages, + thereby limiting the maximum image size to 1/2 of memory, as swsusp does. + + - no_pageset2_if_unneeded + + See the entry for full_pageset2 above for an explanation of pagesets. + Enabling this option causes TuxOnIce to act like no_pageset2 was enabled + if and only it isn't needed anyway. This option may still make TuxOnIce + less reliable because pageset2 pages are normally used to store the + atomic copy - drivers that want to do allocations of larger amounts of + memory in one shot will be more likely to find that those amounts aren't + available if this option is enabled. + + - pause_between_steps (CONFIG_PM_DEBUG): + + This option is used during debugging, to make TuxOnIce pause between + each step of the process. It is ignored when the nice display is on. + + - post_wake_state: + + TuxOnIce provides support for automatically waking after a user-selected + delay, and using a different powerdown method if the lid is still closed. + (Yes, we're assuming a laptop). This entry lets you choose what state + should be entered next. The values are those described under + powerdown_method, below. It can be used to suspend to RAM after hibernating, + then powerdown properly (say) 20 minutes. It can also be used to power down + properly, then wake at (say) 6.30am and suspend to RAM until you're ready + to use the machine. + + - powerdown_method: + + Used to select a method by which TuxOnIce should powerdown after writing the + image. Currently: + + 0: Don't use ACPI to power off. + 3: Attempt to enter Suspend-to-ram. + 4: Attempt to enter ACPI S4 mode. + 5: Attempt to power down via ACPI S5 mode. + + Note that these options are highly dependant upon your hardware & software: + + 3: When succesful, your machine suspends to ram instead of powering off. + The advantage of using this mode is that it doesn't matter whether your + battery has enough charge to make it through to your next resume. If it + lasts, you will simply resume from suspend to ram (and the image on disk + will be discarded). If the battery runs out, you will resume from disk + instead. The disadvantage is that it takes longer than a normal + suspend-to-ram to enter the state, since the suspend-to-disk image needs + to be written first. + 4/5: When successful, your machine will be off and comsume (almost) no power. + But it might still react to some external events like opening the lid or + trafic on a network or usb device. For the bios, resume is then the same + as warm boot, similar to a situation where you used the command `reboot' + to reboot your machine. If your machine has problems on warm boot or if + you want to protect your machine with the bios password, this is probably + not the right choice. Mode 4 may be necessary on some machines where ACPI + wake up methods need to be run to properly reinitialise hardware after a + hibernation cycle. + 0: Switch the machine completely off. The only possible wakeup is the power + button. For the bios, resume is then the same as a cold boot, in + particular you would have to provide your bios boot password if your + machine uses that feature for booting. + + - progressbar_granularity_limit: + + This option can be used to limit the granularity of the progress bar + displayed with a bootsplash screen. The value is the maximum number of + steps. That is, 10 will make the progress bar jump in 10% increments. + + - reboot: + + This option causes TuxOnIce to reboot rather than powering down + at the end of saving an image. It can be toggled during a cycle by pressing + 'R'. + + - resume: + + This sysfs entry can be used to read and set the location in which TuxOnIce + will look for the signature of an image - the value set using resume= at + boot time or CONFIG_PM_STD_PARTITION ("Default resume partition"). By + writing to this file as well as modifying your bootloader's configuration + file (eg menu.lst), you can set or reset the location of your image or the + method of storing the image without rebooting. + + - replace_swsusp (CONFIG_TOI_REPLACE_SWSUSP): + + This option makes + + echo disk > /sys/power/state + + activate TuxOnIce instead of swsusp. Regardless of whether this option is + enabled, any invocation of swsusp's resume time trigger will cause TuxOnIce + to check for an image too. This is due to the fact that at resume time, we + can't know whether this option was enabled until we see if an image is there + for us to resume from. (And when an image exists, we don't care whether we + did replace swsusp anyway - we just want to resume). + + - resume_commandline: + + This entry can be read after resuming to see the commandline that was used + when resuming began. You might use this to set up two bootloader entries + that are the same apart from the fact that one includes a extra append= + argument "at_work=1". You could then grep resume_commandline in your + post-resume scripts and configure networking (for example) differently + depending upon whether you're at home or work. resume_commandline can be + set to arbitrary text if you wish to remove sensitive contents. + + - swap/swapfilename: + + This entry is used to specify the swapfile or partition that + TuxOnIce will attempt to swapon/swapoff automatically. Thus, if + I normally use /dev/hda1 for swap, and want to use /dev/hda2 for specifically + for my hibernation image, I would + + echo /dev/hda2 > /sys/power/tuxonice/swap/swapfile + + /dev/hda2 would then be automatically swapon'd and swapoff'd. Note that the + swapon and swapoff occur while other processes are frozen (including kswapd) + so this swap file will not be used up when attempting to free memory. The + parition/file is also given the highest priority, so other swapfiles/partitions + will only be used to save the image when this one is filled. + + The value of this file is used by headerlocations along with any currently + activated swapfiles/partitions. + + - swap/headerlocations: + + This option tells you the resume= options to use for swap devices you + currently have activated. It is particularly useful when you only want to + use a swap file to store your image. See above for further details. + + - test_bio + + This is a debugging option. When enabled, TuxOnIce will not hibernate. + Instead, when asked to write an image, it will skip the atomic copy, + just doing the writing of the image and then returning control to the + user at the point where it would have powered off. This is useful for + testing throughput in different configurations. + + - test_filter_speed + + This is a debugging option. When enabled, TuxOnIce will not hibernate. + Instead, when asked to write an image, it will not write anything or do + an atomic copy, but will only run any enabled compression algorithm on the + data that would have been written (the source pages of the atomic copy in + the case of pageset 1). This is useful for comparing the performance of + compression algorithms and for determining the extent to which an upgrade + to your storage method would improve hibernation speed. + + - user_interface/debug_sections (CONFIG_PM_DEBUG): + + This value, together with the console log level, controls what debugging + information is displayed. The console log level determines the level of + detail, and this value determines what detail is displayed. This value is + a bit vector, and the meaning of the bits can be found in the kernel tree + in include/linux/tuxonice.h. It can be overridden using the kernel's + command line option suspend_dbg. + + - user_interface/default_console_level (CONFIG_PM_DEBUG): + + This determines the value of the console log level at the start of a + hibernation cycle. If debugging is compiled in, the console log level can be + changed during a cycle by pressing the digit keys. Meanings are: + + 0: Nice display. + 1: Nice display plus numerical progress. + 2: Errors only. + 3: Low level debugging info. + 4: Medium level debugging info. + 5: High level debugging info. + 6: Verbose debugging info. + + - user_interface/enable_escape: + + Setting this to "1" will enable you abort a hibernation cycle or resuming by + pressing escape, "0" (default) disables this feature. Note that enabling + this option means that you cannot initiate a hibernation cycle and then walk + away from your computer, expecting it to be secure. With feature disabled, + you can validly have this expectation once TuxOnice begins to write the + image to disk. (Prior to this point, it is possible that TuxOnice might + about because of failure to freeze all processes or because constraints + on its ability to save the image are not met). + + - user_interface/program + + This entry is used to tell TuxOnice what userspace program to use for + providing a user interface while hibernating. The program uses a netlink + socket to pass messages back and forward to the kernel, allowing all of the + functions formerly implemented in the kernel user interface components. + + - version: + + The version of TuxOnIce you have compiled into the currently running kernel. + + - wake_alarm_dir: + + As mentioned above (post_wake_state), TuxOnIce supports automatically waking + after some delay. This entry allows you to select which wake alarm to use. + It should contain the value "rtc0" if you're wanting to use + /sys/class/rtc/rtc0. + + - wake_delay: + + This value determines the delay from the end of writing the image until the + wake alarm is triggered. You can set an absolute time by writing the desired + time into /sys/class/rtc//wakealarm and leaving these values + empty. + + Note that for the wakeup to actually occur, you may need to modify entries + in /proc/acpi/wakeup. This is done by echoing the name of the button in the + first column (eg PBTN) into the file. + +7. How do you get support? + + Glad you asked. TuxOnIce is being actively maintained and supported + by Nigel (the guy doing most of the kernel coding at the moment), Bernard + (who maintains the hibernate script and userspace user interface components) + and its users. + + Resources availble include HowTos, FAQs and a Wiki, all available via + tuxonice.net. You can find the mailing lists there. + +8. I think I've found a bug. What should I do? + + By far and a way, the most common problems people have with TuxOnIce + related to drivers not having adequate power management support. In this + case, it is not a bug with TuxOnIce, but we can still help you. As we + mentioned above, such issues can usually be worked around by building the + functionality as modules and unloading them while hibernating. Please visit + the Wiki for up-to-date lists of known issues and work arounds. + + If this information doesn't help, try running: + + hibernate --bug-report + + ..and sending the output to the users mailing list. + + Good information on how to provide us with useful information from an + oops is found in the file REPORTING-BUGS, in the top level directory + of the kernel tree. If you get an oops, please especially note the + information about running what is printed on the screen through ksymoops. + The raw information is useless. + +9. When will XXX be supported? + + If there's a feature missing from TuxOnIce that you'd like, feel free to + ask. We try to be obliging, within reason. + + Patches are welcome. Please send to the list. + +10. How does it work? + + TuxOnIce does its work in a number of steps. + + a. Freezing system activity. + + The first main stage in hibernating is to stop all other activity. This is + achieved in stages. Processes are considered in fours groups, which we will + describe in reverse order for clarity's sake: Threads with the PF_NOFREEZE + flag, kernel threads without this flag, userspace processes with the + PF_SYNCTHREAD flag and all other processes. The first set (PF_NOFREEZE) are + untouched by the refrigerator code. They are allowed to run during hibernating + and resuming, and are used to support user interaction, storage access or the + like. Other kernel threads (those unneeded while hibernating) are frozen last. + This leaves us with userspace processes that need to be frozen. When a + process enters one of the *_sync system calls, we set a PF_SYNCTHREAD flag on + that process for the duration of that call. Processes that have this flag are + frozen after processes without it, so that we can seek to ensure that dirty + data is synced to disk as quickly as possible in a situation where other + processes may be submitting writes at the same time. Freezing the processes + that are submitting data stops new I/O from being submitted. Syncthreads can + then cleanly finish their work. So the order is: + + - Userspace processes without PF_SYNCTHREAD or PF_NOFREEZE; + - Userspace processes with PF_SYNCTHREAD (they won't have NOFREEZE); + - Kernel processes without PF_NOFREEZE. + + b. Eating memory. + + For a successful hibernation cycle, you need to have enough disk space to store the + image and enough memory for the various limitations of TuxOnIce's + algorithm. You can also specify a maximum image size. In order to attain + to those constraints, TuxOnIce may 'eat' memory. If, after freezing + processes, the constraints aren't met, TuxOnIce will thaw all the + other processes and begin to eat memory until its calculations indicate + the constraints are met. It will then freeze processes again and recheck + its calculations. + + c. Allocation of storage. + + Next, TuxOnIce allocates the storage that will be used to save + the image. + + The core of TuxOnIce knows nothing about how or where pages are stored. We + therefore request the active allocator (remember you might have compiled in + more than one!) to allocate enough storage for our expect image size. If + this request cannot be fulfilled, we eat more memory and try again. If it + is fulfiled, we seek to allocate additional storage, just in case our + expected compression ratio (if any) isn't achieved. This time, however, we + just continue if we can't allocate enough storage. + + If these calls to our allocator change the characteristics of the image + such that we haven't allocated enough memory, we also loop. (The allocator + may well need to allocate space for its storage information). + + d. Write the first part of the image. + + TuxOnIce stores the image in two sets of pages called 'pagesets'. + Pageset 2 contains pages on the active and inactive lists; essentially + the page cache. Pageset 1 contains all other pages, including the kernel. + We use two pagesets for one important reason: We need to make an atomic copy + of the kernel to ensure consistency of the image. Without a second pageset, + that would limit us to an image that was at most half the amount of memory + available. Using two pagesets allows us to store a full image. Since pageset + 2 pages won't be needed in saving pageset 1, we first save pageset 2 pages. + We can then make our atomic copy of the remaining pages using both pageset 2 + pages and any other pages that are free. While saving both pagesets, we are + careful not to corrupt the image. Among other things, we use lowlevel block + I/O routines that don't change the pagecache contents. + + The next step, then, is writing pageset 2. + + e. Suspending drivers and storing processor context. + + Having written pageset2, TuxOnIce calls the power management functions to + notify drivers of the hibernation, and saves the processor state in preparation + for the atomic copy of memory we are about to make. + + f. Atomic copy. + + At this stage, everything else but the TuxOnIce code is halted. Processes + are frozen or idling, drivers are quiesced and have stored (ideally and where + necessary) their configuration in memory we are about to atomically copy. + In our lowlevel architecture specific code, we have saved the CPU state. + We can therefore now do our atomic copy before resuming drivers etc. + + g. Save the atomic copy (pageset 1). + + TuxOnice can then write the atomic copy of the remaining pages. Since we + have copied the pages into other locations, we can continue to use the + normal block I/O routines without fear of corruption our image. + + f. Save the image header. + + Nearly there! We save our settings and other parameters needed for + reloading pageset 1 in an 'image header'. We also tell our allocator to + serialise its data at this stage, so that it can reread the image at resume + time. + + g. Set the image header. + + Finally, we edit the header at our resume= location. The signature is + changed by the allocator to reflect the fact that an image exists, and to + point to the start of that data if necessary (swap allocator). + + h. Power down. + + Or reboot if we're debugging and the appropriate option is selected. + + Whew! + + Reloading the image. + -------------------- + + Reloading the image is essentially the reverse of all the above. We load + our copy of pageset 1, being careful to choose locations that aren't going + to be overwritten as we copy it back (We start very early in the boot + process, so there are no other processes to quiesce here). We then copy + pageset 1 back to its original location in memory and restore the process + context. We are now running with the original kernel. Next, we reload the + pageset 2 pages, free the memory and swap used by TuxOnIce, restore + the pageset header and restart processes. Sounds easy in comparison to + hibernating, doesn't it! + + There is of course more to TuxOnIce than this, but this explanation + should be a good start. If there's interest, I'll write further + documentation on range pages and the low level I/O. + +11. Who wrote TuxOnIce? + + (Answer based on the writings of Florent Chabaud, credits in files and + Nigel's limited knowledge; apologies to anyone missed out!) + + The main developers of TuxOnIce have been... + + Gabor Kuti + Pavel Machek + Florent Chabaud + Bernard Blackham + Nigel Cunningham + + Significant portions of swsusp, the code in the vanilla kernel which + TuxOnIce enhances, have been worked on by Rafael Wysocki. Thanks should + also be expressed to him. + + The above mentioned developers have been aided in their efforts by a host + of hundreds, if not thousands of testers and people who have submitted bug + fixes & suggestions. Of special note are the efforts of Michael Frank, who + had his computers repetitively hibernate and resume for literally tens of + thousands of cycles and developed scripts to stress the system and test + TuxOnIce far beyond the point most of us (Nigel included!) would consider + testing. His efforts have contributed as much to TuxOnIce as any of the + names above. diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index a93b414..87119dc 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -760,6 +760,14 @@ rtsig-nr shows the number of RT signals currently queued. ============================================================== +sched_schedstats: + +Enables/disables scheduler statistics. Enabling this feature +incurs a small amount of overhead in the scheduler but is +useful for debugging and performance tuning. + +============================================================== + sg-big-buff: This file shows the size of the generic SCSI (sg) buffer. diff --git b/Documentation/tp_smapi.txt b/Documentation/tp_smapi.txt new file mode 100644 index 0000000..d037301 --- /dev/null +++ b/Documentation/tp_smapi.txt @@ -0,0 +1,267 @@ +tp_smapi version 0.40 +IBM ThinkPad hardware functions driver + +Author: Shem Multinymous +Project: http://sourceforge.net/projects/tpctl +Wiki: http://thinkwiki.org/wiki/tp_smapi +List: linux-thinkpad@linux-thinkpad.org + (http://mailman.linux-thinkpad.org/mailman/listinfo/linux-thinkpad) + +Description +----------- + +ThinkPad laptops include a proprietary interface called SMAPI BIOS +(System Management Application Program Interface) which provides some +hardware control functionality that is not accessible by other means. + +This driver exposes some features of the SMAPI BIOS through a sysfs +interface. It is suitable for newer models, on which SMAPI is invoked +through IO port writes. Older models use a different SMAPI interface; +for those, try the "thinkpad" module from the "tpctl" package. + +WARNING: +This driver uses undocumented features and direct hardware access. +It thus cannot be guaranteed to work, and may cause arbitrary damage +(especially on models it wasn't tested on). + + +Module parameters +----------------- + +thinkpad_ec module: + force_io=1 lets thinkpad_ec load on some recent ThinkPad models + (e.g., T400 and T500) whose BIOS's ACPI DSDT reserves the ports we need. +tp_smapi module: + debug=1 enables verbose dmesg output. + + +Usage +----- + +Control of battery charging thresholds (in percents of current full charge +capacity): + +# echo 40 > /sys/devices/platform/smapi/BAT0/start_charge_thresh +# echo 70 > /sys/devices/platform/smapi/BAT0/stop_charge_thresh +# cat /sys/devices/platform/smapi/BAT0/*_charge_thresh + + (This is useful since Li-Ion batteries wear out much faster at very + high or low charge levels. The driver will also keeps the thresholds + across suspend-to-disk with AC disconnected; this isn't done + automatically by the hardware.) + +Inhibiting battery charging for 17 minutes (overrides thresholds): + +# echo 17 > /sys/devices/platform/smapi/BAT0/inhibit_charge_minutes +# echo 0 > /sys/devices/platform/smapi/BAT0/inhibit_charge_minutes # stop +# cat /sys/devices/platform/smapi/BAT0/inhibit_charge_minutes + + (This can be used to control which battery is charged when using an + Ultrabay battery.) + +Forcing battery discharging even if AC power available: + +# echo 1 > /sys/devices/platform/smapi/BAT0/force_discharge # start discharge +# echo 0 > /sys/devices/platform/smapi/BAT0/force_discharge # stop discharge +# cat /sys/devices/platform/smapi/BAT0/force_discharge + + (When AC is connected, forced discharging will automatically stop + when battery is fully depleted -- this is useful for calibration. + Also, this attribute can be used to control which battery is discharged + when both a system battery and an Ultrabay battery are connected.) + +Misc read-only battery status attributes (see note about HDAPS below): + +/sys/devices/platform/smapi/BAT0/installed # 0 or 1 +/sys/devices/platform/smapi/BAT0/state # idle/charging/discharging +/sys/devices/platform/smapi/BAT0/cycle_count # integer counter +/sys/devices/platform/smapi/BAT0/current_now # instantaneous current +/sys/devices/platform/smapi/BAT0/current_avg # last minute average +/sys/devices/platform/smapi/BAT0/power_now # instantaneous power +/sys/devices/platform/smapi/BAT0/power_avg # last minute average +/sys/devices/platform/smapi/BAT0/last_full_capacity # in mWh +/sys/devices/platform/smapi/BAT0/remaining_percent # remaining percent of energy (set by calibration) +/sys/devices/platform/smapi/BAT0/remaining_percent_error # error range of remaing_percent (not reset by calibration) +/sys/devices/platform/smapi/BAT0/remaining_running_time # in minutes, by last minute average power +/sys/devices/platform/smapi/BAT0/remaining_running_time_now # in minutes, by instantenous power +/sys/devices/platform/smapi/BAT0/remaining_charging_time # in minutes +/sys/devices/platform/smapi/BAT0/remaining_capacity # in mWh +/sys/devices/platform/smapi/BAT0/design_capacity # in mWh +/sys/devices/platform/smapi/BAT0/voltage # in mV +/sys/devices/platform/smapi/BAT0/design_voltage # in mV +/sys/devices/platform/smapi/BAT0/charging_max_current # max charging current +/sys/devices/platform/smapi/BAT0/charging_max_voltage # max charging voltage +/sys/devices/platform/smapi/BAT0/group{0,1,2,3}_voltage # see below +/sys/devices/platform/smapi/BAT0/manufacturer # string +/sys/devices/platform/smapi/BAT0/model # string +/sys/devices/platform/smapi/BAT0/barcoding # string +/sys/devices/platform/smapi/BAT0/chemistry # string +/sys/devices/platform/smapi/BAT0/serial # integer +/sys/devices/platform/smapi/BAT0/manufacture_date # YYYY-MM-DD +/sys/devices/platform/smapi/BAT0/first_use_date # YYYY-MM-DD +/sys/devices/platform/smapi/BAT0/temperature # in milli-Celsius +/sys/devices/platform/smapi/BAT0/dump # see below +/sys/devices/platform/smapi/ac_connected # 0 or 1 + +The BAT0/group{0,1,2,3}_voltage attribute refers to the separate cell groups +in each battery. For example, on the ThinkPad 600, X3x, T4x and R5x models, +the battery contains 3 cell groups in series, where each group consisting of 2 +or 3 cells connected in parallel. The voltage of each group is given by these +attributes, and their sum (roughly) equals the "voltage" attribute. +(The effective performance of the battery is determined by the weakest group, +i.e., the one those voltage changes most rapidly during dis/charging.) + +The "BAT0/dump" attribute gives a a hex dump of the raw status data, which +contains additional data now in the above (if you can figure it out). Some +unused values are autodetected and replaced by "--": + +In all of the above, replace BAT0 with BAT1 to address the 2nd battery (e.g. +in the UltraBay). + + +Raw SMAPI calls: + +/sys/devices/platform/smapi/smapi_request +This performs raw SMAPI calls. It uses a bad interface that cannot handle +multiple simultaneous access. Don't touch it, it's for development only. +If you did touch it, you would so something like +# echo '211a 100 0 0' > /sys/devices/platform/smapi/smapi_request +# cat /sys/devices/platform/smapi/smapi_request +and notice that in the output "211a 34b b2 0 0 0 'OK'", the "4b" in the 2nd +value, converted to decimal is 75: the current charge stop threshold. + + +Model-specific status +--------------------- + +Works (at least partially) on the following ThinkPad model: +* A30 +* G41 +* R40, R50p, R51, R52 +* T23, T40, T40p, T41, T41p, T42, T42p, T43, T43p, T60 +* X24, X31, X32, X40, X41, X60 +* Z60t, Z61m + +Not all functions are available on all models; for detailed status, see: + http://thinkwiki.org/wiki/tp_smapi + +Please report success/failure by e-mail or on the Wiki. +If you get a "not implemented" or "not supported" message, your laptop +probably just can't do that (at least not via the SMAPI BIOS). +For negative reports, follow the bug reporting guidelines below. +If you send me the necessary technical data (i.e., SMAPI function +interfaces), I will support additional models. + + +Additional HDAPS features +------------------------- + +The modified hdaps driver has several improvements on the one in mainline +(beyond resolving the conflict with thinkpad_ec and tp_smapi): + +- Fixes reliability and improves support for recent ThinkPad models + (especially *60 and newer). Unlike the mainline driver, the modified hdaps + correctly follows the Embedded Controller communication protocol. + +- Extends the "invert" parameter to cover all possible axis orientations. + The possible values are as follows. + Let X,Y denote the hardware readouts. + Let R denote the laptop's roll (tilt left/right). + Let P denote the laptop's pitch (tilt forward/backward). + invert=0: R= X P= Y (same as mainline) + invert=1: R=-X P=-Y (same as mainline) + invert=2: R=-X P= Y (new) + invert=3: R= X P=-Y (new) + invert=4: R= Y P= X (new) + invert=5: R=-Y P=-X (new) + invert=6: R=-Y P= X (new) + invert=7: R= Y P=-X (new) + It's probably easiest to just try all 8 possibilities and see which yields + correct results (e.g., in the hdaps-gl visualisation). + +- Adds a whitelist which automatically sets the correct axis orientation for + some models. If the value for your model is wrong or missing, you can override + it using the "invert" parameter. Please also update the tables at + http://www.thinkwiki.org/wiki/tp_smapi and + http://www.thinkwiki.org/wiki/List_of_DMI_IDs + and submit a patch for the whitelist in hdaps.c. + +- Provides new attributes: + /sys/devices/platform/hdaps/sampling_rate: + This determines the frequency at which the host queries the embedded + controller for accelerometer data (and informs the hdaps input devices). + Default=50. + /sys/devices/platform/hdaps/oversampling_ratio: + When set to X, the embedded controller is told to do physical accelerometer + measurements at a rate that is X times higher than the rate at which + the driver reads those measurements (i.e., X*sampling_rate). This + makes the readouts from the embedded controller more fresh, and is also + useful for the running average filter (see next). Default=5 + /sys/devices/platform/hdaps/running_avg_filter_order: + When set to X, reported readouts will be the average of the last X physical + accelerometer measurements. Current firmware allows 1<=X<=8. Setting to a + high value decreases readout fluctuations. The averaging is handled by the + embedded controller, so no CPU resources are used. Higher values make the + readouts smoother, since it averages out both sensor noise (good) and abrupt + changes (bad). Default=2. + +- Provides a second input device, which publishes the raw accelerometer + measurements (without the fuzzing needed for joystick emulation). This input + device can be matched by a udev rule such as the following (all on one line): + KERNEL=="event[0-9]*", ATTRS{phys}=="hdaps/input1", + ATTRS{modalias}=="input:b0019v1014p5054e4801-*", + SYMLINK+="input/hdaps/accelerometer-event + +A new version of the hdapsd userspace daemon, which uses the input device +interface instead of polling sysfs, is available seprately. Using this reduces +the total interrupts per second generated by hdaps+hdapsd (on tickless kernels) +to 50, down from a value that fluctuates between 50 and 100. Set the +sampling_rate sysfs attribute to a lower value to further reduce interrupts, +at the expense of response latency. + +Licensing note: all my changes to the HDAPS driver are licensed under the +GPL version 2 or, at your option and to the extent allowed by derivation from +prior works, any later version. My version of hdaps is derived work from the +mainline version, which at the time of writing is available only under +GPL version 2. + +Bug reporting +------------- + +Mail . Please include: +* Details about your model, +* Relevant "dmesg" output. Make sure thinkpad_ec and tp_smapi are loaded with + the "debug=1" parameter (e.g., use "make load HDAPS=1 DEBUG=1"). +* Output of "dmidecode | grep -C5 Product" +* Does the failed functionality works under Windows? + + +More about SMAPI +---------------- + +For hints about what may be possible via the SMAPI BIOS and how, see: + +* IBM Technical Reference Manual for the ThinkPad 770 + (http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=PFAN-3TUQQD) +* Exported symbols in PWRMGRIF.DLL or TPPWRW32.DLL (e.g., use "objdump -x"). +* drivers/char/mwave/smapi.c in the Linux kernel tree.* +* The "thinkpad" SMAPI module (http://tpctl.sourceforge.net). +* The SMAPI_* constants in tp_smapi.c. + +Note that in the above Technical Reference and in the "thinkpad" module, +SMAPI is invoked through a function call to some physical address. However, +the interface used by tp_smapi and the above mwave drive, and apparently +required by newer ThinkPad, is different: you set the parameters up in the +CPU's registers and write to ports 0xB2 (the APM control port) and 0x4F; this +triggers an SMI (System Management Interrupt), causing the CPU to enter +SMM (System Management Mode) and run the BIOS firmware; the results are +returned in the CPU's registers. It is not clear what is the relation between +the two variants of SMAPI, though the assignment of error codes seems to be +similar. + +In addition, the embedded controller on ThinkPad laptops has a non-standard +interface at IO ports 0x1600-0x161F (mapped to LCP channel 3 of the H8S chip). +The interface provides various system management services (currently known: +battery information and accelerometer readouts). For more information see the +thinkpad_ec module and the H8S hardware documentation: +http://documentation.renesas.com/eng/products/mpumcu/rej09b0300_2140bhm.pdf diff --git a/MAINTAINERS b/MAINTAINERS index 6ee06ea..ef7947a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11181,6 +11181,13 @@ S: Maintained F: drivers/tc/ F: include/linux/tc.h +TUXONICE (ENHANCED HIBERNATION) +P: Nigel Cunningham +M: nigel@nigelcunningham.com.au +L: tuxonice-devel@tuxonice.net +W: http://tuxonice.net +S: Maintained + U14-34F SCSI DRIVER M: Dario Ballabio L: linux-scsi@vger.kernel.org diff --git a/Makefile b/Makefile index a7d9810..06fe3c3 100644 --- a/Makefile +++ b/Makefile @@ -615,6 +615,8 @@ KBUILD_CFLAGS += $(call cc-option,-fno-delete-null-pointer-checks,) ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE KBUILD_CFLAGS += -Os $(call cc-disable-warning,maybe-uninitialized,) +else ifdef CONFIG_CC_OPTIMIZE_HARDER +KBUILD_CFLAGS += -O3 $(call cc-disable-warning,maybe-uninitialized,) else KBUILD_CFLAGS += -O2 endif diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index dda1959..08e49c4 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -506,18 +506,18 @@ static void kvm_arm_resume_guest(struct kvm *kvm) struct kvm_vcpu *vcpu; kvm_for_each_vcpu(i, vcpu, kvm) { - wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu); + struct swait_queue_head *wq = kvm_arch_vcpu_wq(vcpu); vcpu->arch.pause = false; - wake_up_interruptible(wq); + swake_up(wq); } } static void vcpu_sleep(struct kvm_vcpu *vcpu) { - wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu); + struct swait_queue_head *wq = kvm_arch_vcpu_wq(vcpu); - wait_event_interruptible(*wq, ((!vcpu->arch.power_off) && + swait_event_interruptible(*wq, ((!vcpu->arch.power_off) && (!vcpu->arch.pause))); } diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c index a9b3b90..c2b1315 100644 --- a/arch/arm/kvm/psci.c +++ b/arch/arm/kvm/psci.c @@ -70,7 +70,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) { struct kvm *kvm = source_vcpu->kvm; struct kvm_vcpu *vcpu = NULL; - wait_queue_head_t *wq; + struct swait_queue_head *wq; unsigned long cpu_id; unsigned long context_id; phys_addr_t target_pc; @@ -119,7 +119,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) smp_mb(); /* Make sure the above is visible */ wq = kvm_arch_vcpu_wq(vcpu); - wake_up_interruptible(wq); + swake_up(wq); return PSCI_RET_SUCCESS; } diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index 3110447..70ef1a4 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -445,8 +445,8 @@ int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, dvcpu->arch.wait = 0; - if (waitqueue_active(&dvcpu->wq)) - wake_up_interruptible(&dvcpu->wq); + if (swait_active(&dvcpu->wq)) + swake_up(&dvcpu->wq); return 0; } @@ -1174,8 +1174,8 @@ static void kvm_mips_comparecount_func(unsigned long data) kvm_mips_callbacks->queue_timer_int(vcpu); vcpu->arch.wait = 0; - if (waitqueue_active(&vcpu->wq)) - wake_up_interruptible(&vcpu->wq); + if (swait_active(&vcpu->wq)) + swake_up(&vcpu->wq); } /* low level hrtimer wake routine */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 9d08d8c..c98afa5 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -289,7 +289,7 @@ struct kvmppc_vcore { struct list_head runnable_threads; struct list_head preempt_list; spinlock_t lock; - wait_queue_head_t wq; + struct swait_queue_head wq; spinlock_t stoltb_lock; /* protects stolen_tb and preempt_tb */ u64 stolen_tb; u64 preempt_tb; @@ -629,7 +629,7 @@ struct kvm_vcpu_arch { u8 prodded; u32 last_inst; - wait_queue_head_t *wqp; + struct swait_queue_head *wqp; struct kvmppc_vcore *vcore; int ret; int trap; diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index baeddb0..f1187bb 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -114,11 +114,11 @@ static bool kvmppc_ipi_thread(int cpu) static void kvmppc_fast_vcpu_kick_hv(struct kvm_vcpu *vcpu) { int cpu; - wait_queue_head_t *wqp; + struct swait_queue_head *wqp; wqp = kvm_arch_vcpu_wq(vcpu); - if (waitqueue_active(wqp)) { - wake_up_interruptible(wqp); + if (swait_active(wqp)) { + swake_up(wqp); ++vcpu->stat.halt_wakeup; } @@ -701,8 +701,8 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) tvcpu->arch.prodded = 1; smp_mb(); if (vcpu->arch.ceded) { - if (waitqueue_active(&vcpu->wq)) { - wake_up_interruptible(&vcpu->wq); + if (swait_active(&vcpu->wq)) { + swake_up(&vcpu->wq); vcpu->stat.halt_wakeup++; } } @@ -1459,7 +1459,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core) INIT_LIST_HEAD(&vcore->runnable_threads); spin_lock_init(&vcore->lock); spin_lock_init(&vcore->stoltb_lock); - init_waitqueue_head(&vcore->wq); + init_swait_queue_head(&vcore->wq); vcore->preempt_tb = TB_NIL; vcore->lpcr = kvm->arch.lpcr; vcore->first_vcpuid = core * threads_per_subcore; @@ -2531,10 +2531,9 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc) { struct kvm_vcpu *vcpu; int do_sleep = 1; + DECLARE_SWAITQUEUE(wait); - DEFINE_WAIT(wait); - - prepare_to_wait(&vc->wq, &wait, TASK_INTERRUPTIBLE); + prepare_to_swait(&vc->wq, &wait, TASK_INTERRUPTIBLE); /* * Check one last time for pending exceptions and ceded state after @@ -2548,7 +2547,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc) } if (!do_sleep) { - finish_wait(&vc->wq, &wait); + finish_swait(&vc->wq, &wait); return; } @@ -2556,7 +2555,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc) trace_kvmppc_vcore_blocked(vc, 0); spin_unlock(&vc->lock); schedule(); - finish_wait(&vc->wq, &wait); + finish_swait(&vc->wq, &wait); spin_lock(&vc->lock); vc->vcore_state = VCORE_INACTIVE; trace_kvmppc_vcore_blocked(vc, 1); @@ -2612,7 +2611,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) kvmppc_start_thread(vcpu, vc); trace_kvm_guest_enter(vcpu); } else if (vc->vcore_state == VCORE_SLEEPING) { - wake_up(&vc->wq); + swake_up(&vc->wq); } } diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index 8959ebb..b0c8ad0 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -467,7 +467,7 @@ struct kvm_s390_irq_payload { struct kvm_s390_local_interrupt { spinlock_t lock; struct kvm_s390_float_interrupt *float_int; - wait_queue_head_t *wq; + struct swait_queue_head *wq; atomic_t *cpuflags; DECLARE_BITMAP(sigp_emerg_pending, KVM_MAX_VCPUS); struct kvm_s390_irq_payload irq; diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c index f88ca72..9ffc732 100644 --- a/arch/s390/kvm/interrupt.c +++ b/arch/s390/kvm/interrupt.c @@ -966,13 +966,13 @@ no_timer: void kvm_s390_vcpu_wakeup(struct kvm_vcpu *vcpu) { - if (waitqueue_active(&vcpu->wq)) { + if (swait_active(&vcpu->wq)) { /* * The vcpu gave up the cpu voluntarily, mark it as a good * yield-candidate. */ vcpu->preempted = true; - wake_up_interruptible(&vcpu->wq); + swake_up(&vcpu->wq); vcpu->stat.halt_wakeup++; } } diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu index 3ba5ff2..ee3e7d4 100644 --- a/arch/x86/Kconfig.cpu +++ b/arch/x86/Kconfig.cpu @@ -147,9 +147,8 @@ config MPENTIUM4 -Paxville -Dempsey - config MK6 - bool "K6/K6-II/K6-III" + bool "AMD K6/K6-II/K6-III" depends on X86_32 ---help--- Select this for an AMD K6-family processor. Enables use of @@ -157,7 +156,7 @@ config MK6 flags to GCC. config MK7 - bool "Athlon/Duron/K7" + bool "AMD Athlon/Duron/K7" depends on X86_32 ---help--- Select this for an AMD Athlon K7-family processor. Enables use of @@ -165,12 +164,69 @@ config MK7 flags to GCC. config MK8 - bool "Opteron/Athlon64/Hammer/K8" + bool "AMD Opteron/Athlon64/Hammer/K8" ---help--- Select this for an AMD Opteron or Athlon64 Hammer-family processor. Enables use of some extended instructions, and passes appropriate optimization flags to GCC. +config MK8SSE3 + bool "AMD Opteron/Athlon64/Hammer/K8 with SSE3" + ---help--- + Select this for improved AMD Opteron or Athlon64 Hammer-family processors. + Enables use of some extended instructions, and passes appropriate + optimization flags to GCC. + +config MK10 + bool "AMD 61xx/7x50/PhenomX3/X4/II/K10" + ---help--- + Select this for an AMD 61xx Eight-Core Magny-Cours, Athlon X2 7x50, + Phenom X3/X4/II, Athlon II X2/X3/X4, or Turion II-family processor. + Enables use of some extended instructions, and passes appropriate + optimization flags to GCC. + +config MBARCELONA + bool "AMD Barcelona" + ---help--- + Select this for AMD Barcelona and newer processors. + + Enables -march=barcelona + +config MBOBCAT + bool "AMD Bobcat" + ---help--- + Select this for AMD Bobcat processors. + + Enables -march=btver1 + +config MBULLDOZER + bool "AMD Bulldozer" + ---help--- + Select this for AMD Bulldozer processors. + + Enables -march=bdver1 + +config MPILEDRIVER + bool "AMD Piledriver" + ---help--- + Select this for AMD Piledriver processors. + + Enables -march=bdver2 + +config MSTEAMROLLER + bool "AMD Steamroller" + ---help--- + Select this for AMD Steamroller processors. + + Enables -march=bdver3 + +config MJAGUAR + bool "AMD Jaguar" + ---help--- + Select this for AMD Jaguar processors. + + Enables -march=btver2 + config MCRUSOE bool "Crusoe" depends on X86_32 @@ -261,8 +317,17 @@ config MPSC using the cpu family field in /proc/cpuinfo. Family 15 is an older Xeon, Family 6 a newer one. +config MATOM + bool "Intel Atom" + ---help--- + + Select this for the Intel Atom platform. Intel Atom CPUs have an + in-order pipelining architecture and thus can benefit from + accordingly optimized code. Use a recent GCC with specific Atom + support in order to fully benefit from selecting this option. + config MCORE2 - bool "Core 2/newer Xeon" + bool "Intel Core 2" ---help--- Select this for Intel Core 2 and newer Core 2 Xeons (Xeon 51xx and @@ -270,14 +335,71 @@ config MCORE2 family in /proc/cpuinfo. Newer ones have 6 and older ones 15 (not a typo) -config MATOM - bool "Intel Atom" + Enables -march=core2 + +config MNEHALEM + bool "Intel Nehalem" ---help--- - Select this for the Intel Atom platform. Intel Atom CPUs have an - in-order pipelining architecture and thus can benefit from - accordingly optimized code. Use a recent GCC with specific Atom - support in order to fully benefit from selecting this option. + Select this for 1st Gen Core processors in the Nehalem family. + + Enables -march=nehalem + +config MWESTMERE + bool "Intel Westmere" + ---help--- + + Select this for the Intel Westmere formerly Nehalem-C family. + + Enables -march=westmere + +config MSILVERMONT + bool "Intel Silvermont" + ---help--- + + Select this for the Intel Silvermont platform. + + Enables -march=silvermont + +config MSANDYBRIDGE + bool "Intel Sandy Bridge" + ---help--- + + Select this for 2nd Gen Core processors in the Sandy Bridge family. + + Enables -march=sandybridge + +config MIVYBRIDGE + bool "Intel Ivy Bridge" + ---help--- + + Select this for 3rd Gen Core processors in the Ivy Bridge family. + + Enables -march=ivybridge + +config MHASWELL + bool "Intel Haswell" + ---help--- + + Select this for 4th Gen Core processors in the Haswell family. + + Enables -march=haswell + +config MBROADWELL + bool "Intel Broadwell" + ---help--- + + Select this for 5th Gen Core processors in the Broadwell family. + + Enables -march=broadwell + +config MSKYLAKE + bool "Intel Skylake" + ---help--- + + Select this for 6th Gen Core processors in the Skylake family. + + Enables -march=skylake config GENERIC_CPU bool "Generic-x86-64" @@ -286,6 +408,19 @@ config GENERIC_CPU Generic x86-64 CPU. Run equally well on all x86-64 CPUs. +config MNATIVE + bool "Native optimizations autodetected by GCC" + ---help--- + + GCC 4.2 and above support -march=native, which automatically detects + the optimum settings to use based on your processor. -march=native + also detects and applies additional settings beyond -march specific + to your CPU, (eg. -msse4). Unless you have a specific reason not to + (e.g. distcc cross-compiling), you should probably be using + -march=native rather than anything listed below. + + Enables -march=native + endchoice config X86_GENERIC @@ -300,6 +435,16 @@ config X86_GENERIC This is really intended for distributors who need more generic optimizations. +config X86_MARCH_NATIVE + bool "Use -march=native cflag (EXPERIMENTAL)" + help + Setting Y here, will result in passing the -march=native and + -mtune=native cflags to GCC while compiling the kernel, which + makes GCC check the CPU capabilities and use the best cflags + for your computer. + + Set Y here only if you use >=gcc-4.2.0. + # # Define implied options from the CPU selection here config X86_INTERNODE_CACHE_SHIFT @@ -310,7 +455,7 @@ config X86_INTERNODE_CACHE_SHIFT config X86_L1_CACHE_SHIFT int default "7" if MPENTIUM4 || MPSC - default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU + default "6" if MK7 || MK8 || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MJAGUAR || MPENTIUMM || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MNATIVE || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU default "4" if MELAN || M486 || MGEODEGX1 default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX @@ -341,11 +486,11 @@ config X86_ALIGNMENT_16 config X86_INTEL_USERCOPY def_bool y - depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK7 || MEFFICEON || MCORE2 + depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK8SSE3 || MK7 || MEFFICEON || MCORE2 || MK10 || MBARCELONA || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MNATIVE config X86_USE_PPRO_CHECKSUM def_bool y - depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MATOM + depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MK10 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MK8SSE3 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MATOM || MNATIVE config X86_USE_3DNOW def_bool y @@ -369,17 +514,17 @@ config X86_P6_NOP config X86_TSC def_bool y - depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MATOM) || X86_64 + depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MK8SSE3 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MNATIVE || MATOM) || X86_64 config X86_CMPXCHG64 def_bool y - depends on X86_PAE || X86_64 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM + depends on X86_PAE || X86_64 || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM || MNATIVE # this should be set for all -march=.. options where the compiler # generates cmov. config X86_CMOV def_bool y - depends on (MK8 || MK7 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MATOM || MGEODE_LX) + depends on (MK8 || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MJAGUAR || MK7 || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MNATIVE || MATOM || MGEODE_LX) config X86_MINIMUM_CPU_FAMILY int diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 4086abc..f728a7c 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -104,13 +104,38 @@ else KBUILD_CFLAGS += $(call cc-option,-mskip-rax-setup) # FIXME - should be integrated in Makefile.cpu (Makefile_32.cpu) + cflags-$(CONFIG_MNATIVE) += $(call cc-option,-march=native) cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8) + cflags-$(CONFIG_MK8SSE3) += $(call cc-option,-march=k8-sse3,-mtune=k8) + cflags-$(CONFIG_MK10) += $(call cc-option,-march=amdfam10) + cflags-$(CONFIG_MBARCELONA) += $(call cc-option,-march=barcelona) + cflags-$(CONFIG_MBOBCAT) += $(call cc-option,-march=btver1) + cflags-$(CONFIG_MBULLDOZER) += $(call cc-option,-march=bdver1) + cflags-$(CONFIG_MPILEDRIVER) += $(call cc-option,-march=bdver2) + cflags-$(CONFIG_MSTEAMROLLER) += $(call cc-option,-march=bdver3) + cflags-$(CONFIG_MJAGUAR) += $(call cc-option,-march=btver2) cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona) cflags-$(CONFIG_MCORE2) += \ - $(call cc-option,-march=core2,$(call cc-option,-mtune=generic)) - cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \ - $(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic)) + $(call cc-option,-march=core2,$(call cc-option,-mtune=core2)) + cflags-$(CONFIG_MNEHALEM) += \ + $(call cc-option,-march=nehalem,$(call cc-option,-mtune=nehalem)) + cflags-$(CONFIG_MWESTMERE) += \ + $(call cc-option,-march=westmere,$(call cc-option,-mtune=westmere)) + cflags-$(CONFIG_MSILVERMONT) += \ + $(call cc-option,-march=silvermont,$(call cc-option,-mtune=silvermont)) + cflags-$(CONFIG_MSANDYBRIDGE) += \ + $(call cc-option,-march=sandybridge,$(call cc-option,-mtune=sandybridge)) + cflags-$(CONFIG_MIVYBRIDGE) += \ + $(call cc-option,-march=ivybridge,$(call cc-option,-mtune=ivybridge)) + cflags-$(CONFIG_MHASWELL) += \ + $(call cc-option,-march=haswell,$(call cc-option,-mtune=haswell)) + cflags-$(CONFIG_MBROADWELL) += \ + $(call cc-option,-march=broadwell,$(call cc-option,-mtune=broadwell)) + cflags-$(CONFIG_MSKYLAKE) += \ + $(call cc-option,-march=skylake,$(call cc-option,-mtune=skylake)) + cflags-$(CONFIG_MATOM) += $(call cc-option,-march=bonnell) \ + $(call cc-option,-mtune=bonnell,$(call cc-option,-mtune=generic)) cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic) KBUILD_CFLAGS += $(cflags-y) diff --git a/arch/x86/Makefile_32.cpu b/arch/x86/Makefile_32.cpu index 6647ed4..6b71290 100644 --- a/arch/x86/Makefile_32.cpu +++ b/arch/x86/Makefile_32.cpu @@ -10,6 +10,9 @@ tune = $(call cc-option,-mcpu=$(1),$(2)) endif align := $(cc-option-align) +ifeq ($(CONFIG_X86_MARCH_NATIVE),y) +cflags-y += -march=native +else cflags-$(CONFIG_M486) += -march=i486 cflags-$(CONFIG_M586) += -march=i586 cflags-$(CONFIG_M586TSC) += -march=i586 @@ -23,7 +26,16 @@ cflags-$(CONFIG_MK6) += -march=k6 # Please note, that patches that add -march=athlon-xp and friends are pointless. # They make zero difference whatsosever to performance at this time. cflags-$(CONFIG_MK7) += -march=athlon +cflags-$(CONFIG_MNATIVE) += $(call cc-option,-march=native) cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8,-march=athlon) +cflags-$(CONFIG_MK8SSE3) += $(call cc-option,-march=k8-sse3,-march=athlon) +cflags-$(CONFIG_MK10) += $(call cc-option,-march=amdfam10,-march=athlon) +cflags-$(CONFIG_MBARCELONA) += $(call cc-option,-march=barcelona,-march=athlon) +cflags-$(CONFIG_MBOBCAT) += $(call cc-option,-march=btver1,-march=athlon) +cflags-$(CONFIG_MBULLDOZER) += $(call cc-option,-march=bdver1,-march=athlon) +cflags-$(CONFIG_MPILEDRIVER) += $(call cc-option,-march=bdver2,-march=athlon) +cflags-$(CONFIG_MSTEAMROLLER) += $(call cc-option,-march=bdver3,-march=athlon) +cflags-$(CONFIG_MJAGUAR) += $(call cc-option,-march=btver2,-march=athlon) cflags-$(CONFIG_MCRUSOE) += -march=i686 $(align)-functions=0 $(align)-jumps=0 $(align)-loops=0 cflags-$(CONFIG_MEFFICEON) += -march=i686 $(call tune,pentium3) $(align)-functions=0 $(align)-jumps=0 $(align)-loops=0 cflags-$(CONFIG_MWINCHIPC6) += $(call cc-option,-march=winchip-c6,-march=i586) @@ -32,8 +44,16 @@ cflags-$(CONFIG_MCYRIXIII) += $(call cc-option,-march=c3,-march=i486) $(align)-f cflags-$(CONFIG_MVIAC3_2) += $(call cc-option,-march=c3-2,-march=i686) cflags-$(CONFIG_MVIAC7) += -march=i686 cflags-$(CONFIG_MCORE2) += -march=i686 $(call tune,core2) -cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom,$(call cc-option,-march=core2,-march=i686)) \ - $(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic)) +cflags-$(CONFIG_MNEHALEM) += -march=i686 $(call tune,nehalem) +cflags-$(CONFIG_MWESTMERE) += -march=i686 $(call tune,westmere) +cflags-$(CONFIG_MSILVERMONT) += -march=i686 $(call tune,silvermont) +cflags-$(CONFIG_MSANDYBRIDGE) += -march=i686 $(call tune,sandybridge) +cflags-$(CONFIG_MIVYBRIDGE) += -march=i686 $(call tune,ivybridge) +cflags-$(CONFIG_MHASWELL) += -march=i686 $(call tune,haswell) +cflags-$(CONFIG_MBROADWELL) += -march=i686 $(call tune,broadwell) +cflags-$(CONFIG_MSKYLAKE) += -march=i686 $(call tune,skylake) +cflags-$(CONFIG_MATOM) += $(call cc-option,-march=bonnell,$(call cc-option,-march=core2,-march=i686)) \ + $(call cc-option,-mtune=bonnell,$(call cc-option,-mtune=generic)) # AMD Elan support cflags-$(CONFIG_MELAN) += -march=i486 @@ -41,6 +61,8 @@ cflags-$(CONFIG_MELAN) += -march=i486 # Geode GX1 support cflags-$(CONFIG_MGEODEGX1) += -march=pentium-mmx cflags-$(CONFIG_MGEODE_LX) += $(call cc-option,-march=geode,-march=pentium-mmx) +endif + # add at the end to overwrite eventual tuning options from earlier # cpu entries cflags-$(CONFIG_X86_GENERIC) += $(call tune,generic,$(call tune,i686)) diff --git a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h index e3b7819..90f3c76 100644 --- a/arch/x86/include/asm/module.h +++ b/arch/x86/include/asm/module.h @@ -15,6 +15,24 @@ #define MODULE_PROC_FAMILY "586MMX " #elif defined CONFIG_MCORE2 #define MODULE_PROC_FAMILY "CORE2 " +#elif defined CONFIG_MNATIVE +#define MODULE_PROC_FAMILY "NATIVE " +#elif defined CONFIG_MNEHALEM +#define MODULE_PROC_FAMILY "NEHALEM " +#elif defined CONFIG_MWESTMERE +#define MODULE_PROC_FAMILY "WESTMERE " +#elif defined CONFIG_MSILVERMONT +#define MODULE_PROC_FAMILY "SILVERMONT " +#elif defined CONFIG_MSANDYBRIDGE +#define MODULE_PROC_FAMILY "SANDYBRIDGE " +#elif defined CONFIG_MIVYBRIDGE +#define MODULE_PROC_FAMILY "IVYBRIDGE " +#elif defined CONFIG_MHASWELL +#define MODULE_PROC_FAMILY "HASWELL " +#elif defined CONFIG_MBROADWELL +#define MODULE_PROC_FAMILY "BROADWELL " +#elif defined CONFIG_MSKYLAKE +#define MODULE_PROC_FAMILY "SKYLAKE " #elif defined CONFIG_MATOM #define MODULE_PROC_FAMILY "ATOM " #elif defined CONFIG_M686 @@ -33,6 +51,22 @@ #define MODULE_PROC_FAMILY "K7 " #elif defined CONFIG_MK8 #define MODULE_PROC_FAMILY "K8 " +#elif defined CONFIG_MK8SSE3 +#define MODULE_PROC_FAMILY "K8SSE3 " +#elif defined CONFIG_MK10 +#define MODULE_PROC_FAMILY "K10 " +#elif defined CONFIG_MBARCELONA +#define MODULE_PROC_FAMILY "BARCELONA " +#elif defined CONFIG_MBOBCAT +#define MODULE_PROC_FAMILY "BOBCAT " +#elif defined CONFIG_MBULLDOZER +#define MODULE_PROC_FAMILY "BULLDOZER " +#elif defined CONFIG_MPILEDRIVER +#define MODULE_PROC_FAMILY "STEAMROLLER " +#elif defined CONFIG_MSTEAMROLLER +#define MODULE_PROC_FAMILY "PILEDRIVER " +#elif defined CONFIG_MJAGUAR +#define MODULE_PROC_FAMILY "JAGUAR " #elif defined CONFIG_MELAN #define MODULE_PROC_FAMILY "ELAN " #elif defined CONFIG_MCRUSOE diff --git a/arch/x86/kernel/early_printk.c b/arch/x86/kernel/early_printk.c index 21bf924..1fbf5af 100644 --- a/arch/x86/kernel/early_printk.c +++ b/arch/x86/kernel/early_printk.c @@ -27,7 +27,8 @@ static int max_ypos = 25, max_xpos = 80; static int current_ypos = 25, current_xpos; -static void early_vga_write(struct console *con, const char *str, unsigned n) +static void early_vga_write(struct console *con, const char *str, unsigned n, + unsigned int loglevel) { char c; int i, k, j; @@ -118,7 +119,8 @@ static int early_serial_putc(unsigned char ch) return timeout ? 0 : -1; } -static void early_serial_write(struct console *con, const char *s, unsigned n) +static void early_serial_write(struct console *con, const char *s, unsigned n, + unsigned int loglevel) { while (*s && n-- > 0) { if (*s == '\n') diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c index 4d38416..a457bce 100644 --- a/arch/x86/kernel/espfix_64.c +++ b/arch/x86/kernel/espfix_64.c @@ -173,6 +173,7 @@ void init_espfix_ap(int cpu) struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0); pmd_p = (pmd_t *)page_address(page); + SetPageTOI_Untracked(virt_to_page(pmd_p)); pud = __pud(__pa(pmd_p) | (PGTABLE_PROT & ptemask)); paravirt_alloc_pmd(&init_mm, __pa(pmd_p) >> PAGE_SHIFT); for (n = 0; n < ESPFIX_PUD_CLONES; n++) @@ -185,6 +186,7 @@ void init_espfix_ap(int cpu) struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0); pte_p = (pte_t *)page_address(page); + SetPageTOI_Untracked(virt_to_page(pte_p)); pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT & ptemask)); paravirt_alloc_pte(&init_mm, __pa(pte_p) >> PAGE_SHIFT); for (n = 0; n < ESPFIX_PMD_CLONES; n++) @@ -193,6 +195,7 @@ void init_espfix_ap(int cpu) pte_p = pte_offset_kernel(&pmd, addr); stack_page = page_address(alloc_pages_node(node, GFP_KERNEL, 0)); + SetPageTOI_Untracked(virt_to_page(stack_page)); pte = __pte(__pa(stack_page) | (__PAGE_KERNEL_RO & ptemask)); for (n = 0; n < ESPFIX_PTE_CLONES; n++) set_pte(&pte_p[n*PTE_STRIDE], pte); diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 3d743da..4b20bb7 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -195,6 +196,10 @@ static void cyc2ns_init(int cpu) c2n->head = c2n->data; c2n->tail = c2n->data; + + // Don't let TuxOnIce make data RO - a secondary CPU will cause a triple fault + // if it loads microcode, which then does a printk, which may end up invoking cycles_2_ns + SetPageTOI_Untracked(virt_to_page(c2n)); } static inline unsigned long long cycles_2_ns(unsigned long long cyc) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 36591fa..3a045f3 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1195,7 +1195,7 @@ static void apic_update_lvtt(struct kvm_lapic *apic) static void apic_timer_expired(struct kvm_lapic *apic) { struct kvm_vcpu *vcpu = apic->vcpu; - wait_queue_head_t *q = &vcpu->wq; + struct swait_queue_head *q = &vcpu->wq; struct kvm_timer *ktimer = &apic->lapic_timer; if (atomic_read(&apic->lapic_timer.pending)) @@ -1204,8 +1204,8 @@ static void apic_timer_expired(struct kvm_lapic *apic) atomic_inc(&apic->lapic_timer.pending); kvm_set_pending_timer(vcpu); - if (waitqueue_active(q)) - wake_up_interruptible(q); + if (swait_active(q)) + swake_up(q); if (apic_lvtt_tscdeadline(apic)) ktimer->expired_tscdeadline = ktimer->tscdeadline; diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e830c71..8d2472b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -13,6 +13,7 @@ #include /* hstate_index_to_shift */ #include /* prefetchw */ #include /* exception_enter(), ... */ +#include /* incremental image support */ #include /* faulthandler_disabled() */ #include /* dotraplinkage, ... */ @@ -662,6 +663,10 @@ no_context(struct pt_regs *regs, unsigned long error_code, unsigned long flags; int sig; + if (toi_make_writable(init_mm.pgd, address)) { + return; + } + /* Are we prepared to handle this kernel fault? */ if (fixup_exception(regs)) { /* @@ -916,10 +921,101 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, } } +#ifdef CONFIG_TOI_INCREMENTAL +/** + * _toi_do_cbw - Do a copy-before-write before letting the faulting process continue + */ +static void toi_do_cbw(struct page *page) +{ + struct toi_cbw_state *state = this_cpu_ptr(&toi_cbw_states); + + state->active = 1; + wmb(); + + if (state->enabled && state->next && PageTOI_CBW(page)) { + struct toi_cbw *this = state->next; + memcpy(this->virt, page_address(page), PAGE_SIZE); + this->pfn = page_to_pfn(page); + state->next = this->next; + } + + state->active = 0; +} + +/** + * _toi_make_writable - Defuse TOI's write protection + */ +int _toi_make_writable(pte_t *pte) +{ + struct page *page = pte_page(*pte); + if (PageTOI_RO(page)) { + pgd_t *pgd = __va(read_cr3()); + /* + * If this is a TuxOnIce caused fault, we may not have permission to + * write to a page needed to reset the permissions of the original + * page. Use swapper_pg_dir to get around this. + */ + load_cr3(swapper_pg_dir); + + set_pte_atomic(pte, pte_mkwrite(*pte)); + SetPageTOI_Dirty(page); + ClearPageTOI_RO(page); + + toi_do_cbw(page); + + load_cr3(pgd); + return 1; + } + return 0; +} + +/** + * toi_make_writable - Handle a (potential) fault caused by TOI's write protection + * + * Make a page writable that was protected. Might be because of a fault, or + * because we're allocating it and want it to be untracked. + * + * Note that in the fault handling case, we don't care about the error code. If + * called from the double fault handler, we won't have one. We just check to + * see if the page was made RO by TOI, and mark it dirty/release the protection + * if it was. + */ +int toi_make_writable(pgd_t *pgd, unsigned long address) +{ + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + pgd = pgd + pgd_index(address); + if (!pgd_present(*pgd)) + return 0; + + pud = pud_offset(pgd, address); + if (!pud_present(*pud)) + return 0; + + if (pud_large(*pud)) + return _toi_make_writable((pte_t *) pud); + + pmd = pmd_offset(pud, address); + if (!pmd_present(*pmd)) + return 0; + + if (pmd_large(*pmd)) + return _toi_make_writable((pte_t *) pmd); + + pte = pte_offset_kernel(pmd, address); + if (!pte_present(*pte)) + return 0; + + return _toi_make_writable(pte); +} +#endif + static int spurious_fault_check(unsigned long error_code, pte_t *pte) { if ((error_code & PF_WRITE) && !pte_write(*pte)) - return 0; + return 0; if ((error_code & PF_INSTR) && !pte_exec(*pte)) return 0; @@ -1082,6 +1178,15 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, kmemcheck_hide(regs); prefetchw(&mm->mmap_sem); + /* + * Detect and handle page faults due to TuxOnIce making pages read-only + * so that it can create incremental images. + * + * Do it early to avoid double faults. + */ + if (unlikely(toi_make_writable(init_mm.pgd, address))) + return; + if (unlikely(kmmio_fault(regs, address))) return; diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 493f541..695ac7d 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -150,9 +150,10 @@ static int page_size_mask; static void __init probe_page_size_mask(void) { -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_TOI_INCREMENTAL) /* - * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages. + * For CONFIG_DEBUG_PAGEALLOC or TuxOnIce's incremental image support, + * identity mapping will use small pages. * This will simplify cpa(), which otherwise needs to support splitting * large pages into small in interrupt context, etc. */ diff --git a/arch/x86/platform/efi/early_printk.c b/arch/x86/platform/efi/early_printk.c index 5241421..5eb772c 100644 --- a/arch/x86/platform/efi/early_printk.c +++ b/arch/x86/platform/efi/early_printk.c @@ -124,7 +124,8 @@ static void early_efi_write_char(u32 *dst, unsigned char c, unsigned int h) } static void -early_efi_write(struct console *con, const char *str, unsigned int num) +early_efi_write(struct console *con, const char *str, unsigned int num, + unsigned int loglevel) { struct screen_info *si; unsigned int len; diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched index 421bef9..1fc1a4d 100644 --- a/block/Kconfig.iosched +++ b/block/Kconfig.iosched @@ -39,9 +39,28 @@ config CFQ_GROUP_IOSCHED ---help--- Enable group IO scheduling in CFQ. +config IOSCHED_BFQ + tristate "BFQ I/O scheduler" + default y + ---help--- + The BFQ I/O scheduler tries to distribute bandwidth among + all processes according to their weights. + It aims at distributing the bandwidth as desired, independently of + the disk parameters and with any workload. It also tries to + guarantee low latency to interactive and soft real-time + applications. If compiled built-in (saying Y here), BFQ can + be configured to support hierarchical scheduling. + +config BFQ_GROUP_IOSCHED + bool "BFQ hierarchical scheduling support" + depends on CGROUPS && IOSCHED_BFQ=y + default n + ---help--- + Enable hierarchical scheduling in BFQ, using the blkio controller. + choice prompt "Default I/O scheduler" - default DEFAULT_CFQ + default DEFAULT_BFQ help Select the I/O scheduler which will be used by default for all block devices. @@ -52,6 +71,16 @@ choice config DEFAULT_CFQ bool "CFQ" if IOSCHED_CFQ=y + config DEFAULT_BFQ + bool "BFQ" if IOSCHED_BFQ=y + help + Selects BFQ as the default I/O scheduler which will be + used by default for all block devices. + The BFQ I/O scheduler aims at distributing the bandwidth + as desired, independently of the disk parameters and with + any workload. It also tries to guarantee low latency to + interactive and soft real-time applications. + config DEFAULT_NOOP bool "No-op" @@ -61,6 +90,7 @@ config DEFAULT_IOSCHED string default "deadline" if DEFAULT_DEADLINE default "cfq" if DEFAULT_CFQ + default "bfq" if DEFAULT_BFQ default "noop" if DEFAULT_NOOP endmenu diff --git a/block/Makefile b/block/Makefile index 9eda232..37afa48 100644 --- a/block/Makefile +++ b/block/Makefile @@ -7,7 +7,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \ blk-lib.o blk-mq.o blk-mq-tag.o \ blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \ - genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ + uuid.o genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ badblocks.o partitions/ obj-$(CONFIG_BOUNCE) += bounce.o @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o +obj-$(CONFIG_IOSCHED_BFQ) += bfq-iosched.o obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o diff --git b/block/bfq-cgroup.c b/block/bfq-cgroup.c new file mode 100644 index 0000000..5ee99ec --- /dev/null +++ b/block/bfq-cgroup.c @@ -0,0 +1,1186 @@ +/* + * BFQ: CGROUPS support. + * + * Based on ideas and code from CFQ: + * Copyright (C) 2003 Jens Axboe + * + * Copyright (C) 2008 Fabio Checconi + * Paolo Valente + * + * Copyright (C) 2010 Paolo Valente + * + * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ + * file. + */ + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + +/* bfqg stats flags */ +enum bfqg_stats_flags { + BFQG_stats_waiting = 0, + BFQG_stats_idling, + BFQG_stats_empty, +}; + +#define BFQG_FLAG_FNS(name) \ +static void bfqg_stats_mark_##name(struct bfqg_stats *stats) \ +{ \ + stats->flags |= (1 << BFQG_stats_##name); \ +} \ +static void bfqg_stats_clear_##name(struct bfqg_stats *stats) \ +{ \ + stats->flags &= ~(1 << BFQG_stats_##name); \ +} \ +static int bfqg_stats_##name(struct bfqg_stats *stats) \ +{ \ + return (stats->flags & (1 << BFQG_stats_##name)) != 0; \ +} \ + +BFQG_FLAG_FNS(waiting) +BFQG_FLAG_FNS(idling) +BFQG_FLAG_FNS(empty) +#undef BFQG_FLAG_FNS + +/* This should be called with the queue_lock held. */ +static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats) +{ + unsigned long long now; + + if (!bfqg_stats_waiting(stats)) + return; + + now = sched_clock(); + if (time_after64(now, stats->start_group_wait_time)) + blkg_stat_add(&stats->group_wait_time, + now - stats->start_group_wait_time); + bfqg_stats_clear_waiting(stats); +} + +/* This should be called with the queue_lock held. */ +static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg, + struct bfq_group *curr_bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + if (bfqg_stats_waiting(stats)) + return; + if (bfqg == curr_bfqg) + return; + stats->start_group_wait_time = sched_clock(); + bfqg_stats_mark_waiting(stats); +} + +/* This should be called with the queue_lock held. */ +static void bfqg_stats_end_empty_time(struct bfqg_stats *stats) +{ + unsigned long long now; + + if (!bfqg_stats_empty(stats)) + return; + + now = sched_clock(); + if (time_after64(now, stats->start_empty_time)) + blkg_stat_add(&stats->empty_time, + now - stats->start_empty_time); + bfqg_stats_clear_empty(stats); +} + +static void bfqg_stats_update_dequeue(struct bfq_group *bfqg) +{ + blkg_stat_add(&bfqg->stats.dequeue, 1); +} + +static void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + if (blkg_rwstat_total(&stats->queued)) + return; + + /* + * group is already marked empty. This can happen if bfqq got new + * request in parent group and moved to this group while being added + * to service tree. Just ignore the event and move on. + */ + if (bfqg_stats_empty(stats)) + return; + + stats->start_empty_time = sched_clock(); + bfqg_stats_mark_empty(stats); +} + +static void bfqg_stats_update_idle_time(struct bfq_group *bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + if (bfqg_stats_idling(stats)) { + unsigned long long now = sched_clock(); + + if (time_after64(now, stats->start_idle_time)) + blkg_stat_add(&stats->idle_time, + now - stats->start_idle_time); + bfqg_stats_clear_idling(stats); + } +} + +static void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + stats->start_idle_time = sched_clock(); + bfqg_stats_mark_idling(stats); +} + +static void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + blkg_stat_add(&stats->avg_queue_size_sum, + blkg_rwstat_total(&stats->queued)); + blkg_stat_add(&stats->avg_queue_size_samples, 1); + bfqg_stats_update_group_wait_time(stats); +} + +static struct blkcg_policy blkcg_policy_bfq; + +/* + * blk-cgroup policy-related handlers + * The following functions help in converting between blk-cgroup + * internal structures and BFQ-specific structures. + */ + +static struct bfq_group *pd_to_bfqg(struct blkg_policy_data *pd) +{ + return pd ? container_of(pd, struct bfq_group, pd) : NULL; +} + +static struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg) +{ + return pd_to_blkg(&bfqg->pd); +} + +static struct bfq_group *blkg_to_bfqg(struct blkcg_gq *blkg) +{ + struct blkg_policy_data *pd = blkg_to_pd(blkg, &blkcg_policy_bfq); + BUG_ON(!pd); + return pd_to_bfqg(pd); +} + +/* + * bfq_group handlers + * The following functions help in navigating the bfq_group hierarchy + * by allowing to find the parent of a bfq_group or the bfq_group + * associated to a bfq_queue. + */ + +static struct bfq_group *bfqg_parent(struct bfq_group *bfqg) +{ + struct blkcg_gq *pblkg = bfqg_to_blkg(bfqg)->parent; + + return pblkg ? blkg_to_bfqg(pblkg) : NULL; +} + +static struct bfq_group *bfqq_group(struct bfq_queue *bfqq) +{ + struct bfq_entity *group_entity = bfqq->entity.parent; + + return group_entity ? container_of(group_entity, struct bfq_group, + entity) : + bfqq->bfqd->root_group; +} + +/* + * The following two functions handle get and put of a bfq_group by + * wrapping the related blk-cgroup hooks. + */ + +static void bfqg_get(struct bfq_group *bfqg) +{ + return blkg_get(bfqg_to_blkg(bfqg)); +} + +static void bfqg_put(struct bfq_group *bfqg) +{ + return blkg_put(bfqg_to_blkg(bfqg)); +} + +static void bfqg_stats_update_io_add(struct bfq_group *bfqg, + struct bfq_queue *bfqq, + int rw) +{ + blkg_rwstat_add(&bfqg->stats.queued, rw, 1); + bfqg_stats_end_empty_time(&bfqg->stats); + if (!(bfqq == ((struct bfq_data *)bfqg->bfqd)->in_service_queue)) + bfqg_stats_set_start_group_wait_time(bfqg, bfqq_group(bfqq)); +} + +static void bfqg_stats_update_io_remove(struct bfq_group *bfqg, int rw) +{ + blkg_rwstat_add(&bfqg->stats.queued, rw, -1); +} + +static void bfqg_stats_update_io_merged(struct bfq_group *bfqg, int rw) +{ + blkg_rwstat_add(&bfqg->stats.merged, rw, 1); +} + +static void bfqg_stats_update_dispatch(struct bfq_group *bfqg, + uint64_t bytes, int rw) +{ + blkg_stat_add(&bfqg->stats.sectors, bytes >> 9); + blkg_rwstat_add(&bfqg->stats.serviced, rw, 1); + blkg_rwstat_add(&bfqg->stats.service_bytes, rw, bytes); +} + +static void bfqg_stats_update_completion(struct bfq_group *bfqg, + uint64_t start_time, uint64_t io_start_time, int rw) +{ + struct bfqg_stats *stats = &bfqg->stats; + unsigned long long now = sched_clock(); + + if (time_after64(now, io_start_time)) + blkg_rwstat_add(&stats->service_time, rw, now - io_start_time); + if (time_after64(io_start_time, start_time)) + blkg_rwstat_add(&stats->wait_time, rw, + io_start_time - start_time); +} + +/* @stats = 0 */ +static void bfqg_stats_reset(struct bfqg_stats *stats) +{ + if (!stats) + return; + + /* queued stats shouldn't be cleared */ + blkg_rwstat_reset(&stats->service_bytes); + blkg_rwstat_reset(&stats->serviced); + blkg_rwstat_reset(&stats->merged); + blkg_rwstat_reset(&stats->service_time); + blkg_rwstat_reset(&stats->wait_time); + blkg_stat_reset(&stats->time); + blkg_stat_reset(&stats->unaccounted_time); + blkg_stat_reset(&stats->avg_queue_size_sum); + blkg_stat_reset(&stats->avg_queue_size_samples); + blkg_stat_reset(&stats->dequeue); + blkg_stat_reset(&stats->group_wait_time); + blkg_stat_reset(&stats->idle_time); + blkg_stat_reset(&stats->empty_time); +} + +/* @to += @from */ +static void bfqg_stats_merge(struct bfqg_stats *to, struct bfqg_stats *from) +{ + if (!to || !from) + return; + + /* queued stats shouldn't be cleared */ + blkg_rwstat_add_aux(&to->service_bytes, &from->service_bytes); + blkg_rwstat_add_aux(&to->serviced, &from->serviced); + blkg_rwstat_add_aux(&to->merged, &from->merged); + blkg_rwstat_add_aux(&to->service_time, &from->service_time); + blkg_rwstat_add_aux(&to->wait_time, &from->wait_time); + blkg_stat_add_aux(&from->time, &from->time); + blkg_stat_add_aux(&to->unaccounted_time, &from->unaccounted_time); + blkg_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum); + blkg_stat_add_aux(&to->avg_queue_size_samples, &from->avg_queue_size_samples); + blkg_stat_add_aux(&to->dequeue, &from->dequeue); + blkg_stat_add_aux(&to->group_wait_time, &from->group_wait_time); + blkg_stat_add_aux(&to->idle_time, &from->idle_time); + blkg_stat_add_aux(&to->empty_time, &from->empty_time); +} + +/* + * Transfer @bfqg's stats to its parent's dead_stats so that the ancestors' + * recursive stats can still account for the amount used by this bfqg after + * it's gone. + */ +static void bfqg_stats_xfer_dead(struct bfq_group *bfqg) +{ + struct bfq_group *parent; + + if (!bfqg) /* root_group */ + return; + + parent = bfqg_parent(bfqg); + + lockdep_assert_held(bfqg_to_blkg(bfqg)->q->queue_lock); + + if (unlikely(!parent)) + return; + + bfqg_stats_merge(&parent->dead_stats, &bfqg->stats); + bfqg_stats_merge(&parent->dead_stats, &bfqg->dead_stats); + bfqg_stats_reset(&bfqg->stats); + bfqg_stats_reset(&bfqg->dead_stats); +} + +static void bfq_init_entity(struct bfq_entity *entity, + struct bfq_group *bfqg) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + entity->weight = entity->new_weight; + entity->orig_weight = entity->new_weight; + if (bfqq) { + bfqq->ioprio = bfqq->new_ioprio; + bfqq->ioprio_class = bfqq->new_ioprio_class; + bfqg_get(bfqg); + } + entity->parent = bfqg->my_entity; + entity->sched_data = &bfqg->sched_data; +} + +static void bfqg_stats_exit(struct bfqg_stats *stats) +{ + blkg_rwstat_exit(&stats->service_bytes); + blkg_rwstat_exit(&stats->serviced); + blkg_rwstat_exit(&stats->merged); + blkg_rwstat_exit(&stats->service_time); + blkg_rwstat_exit(&stats->wait_time); + blkg_rwstat_exit(&stats->queued); + blkg_stat_exit(&stats->sectors); + blkg_stat_exit(&stats->time); + blkg_stat_exit(&stats->unaccounted_time); + blkg_stat_exit(&stats->avg_queue_size_sum); + blkg_stat_exit(&stats->avg_queue_size_samples); + blkg_stat_exit(&stats->dequeue); + blkg_stat_exit(&stats->group_wait_time); + blkg_stat_exit(&stats->idle_time); + blkg_stat_exit(&stats->empty_time); +} + +static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp) +{ + if (blkg_rwstat_init(&stats->service_bytes, gfp) || + blkg_rwstat_init(&stats->serviced, gfp) || + blkg_rwstat_init(&stats->merged, gfp) || + blkg_rwstat_init(&stats->service_time, gfp) || + blkg_rwstat_init(&stats->wait_time, gfp) || + blkg_rwstat_init(&stats->queued, gfp) || + blkg_stat_init(&stats->sectors, gfp) || + blkg_stat_init(&stats->time, gfp) || + blkg_stat_init(&stats->unaccounted_time, gfp) || + blkg_stat_init(&stats->avg_queue_size_sum, gfp) || + blkg_stat_init(&stats->avg_queue_size_samples, gfp) || + blkg_stat_init(&stats->dequeue, gfp) || + blkg_stat_init(&stats->group_wait_time, gfp) || + blkg_stat_init(&stats->idle_time, gfp) || + blkg_stat_init(&stats->empty_time, gfp)) { + bfqg_stats_exit(stats); + return -ENOMEM; + } + + return 0; +} + +static struct bfq_group_data *cpd_to_bfqgd(struct blkcg_policy_data *cpd) + { + return cpd ? container_of(cpd, struct bfq_group_data, pd) : NULL; + } + +static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg) +{ + return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq)); +} + +static void bfq_cpd_init(struct blkcg_policy_data *cpd) +{ + struct bfq_group_data *d = cpd_to_bfqgd(cpd); + + d->weight = BFQ_DEFAULT_GRP_WEIGHT; +} + +static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node) +{ + struct bfq_group *bfqg; + + bfqg = kzalloc_node(sizeof(*bfqg), gfp, node); + if (!bfqg) + return NULL; + + if (bfqg_stats_init(&bfqg->stats, gfp) || + bfqg_stats_init(&bfqg->dead_stats, gfp)) { + kfree(bfqg); + return NULL; + } + + return &bfqg->pd; +} + +static void bfq_group_set_parent(struct bfq_group *bfqg, + struct bfq_group *parent) +{ + struct bfq_entity *entity; + + BUG_ON(!parent); + BUG_ON(!bfqg); + BUG_ON(bfqg == parent); + + entity = &bfqg->entity; + entity->parent = parent->my_entity; + entity->sched_data = &parent->sched_data; +} + +static void bfq_pd_init(struct blkg_policy_data *pd) +{ + struct blkcg_gq *blkg = pd_to_blkg(pd); + struct bfq_group *bfqg = blkg_to_bfqg(blkg); + struct bfq_data *bfqd = blkg->q->elevator->elevator_data; + struct bfq_entity *entity = &bfqg->entity; + struct bfq_group_data *d = blkcg_to_bfqgd(blkg->blkcg); + + entity->orig_weight = entity->weight = entity->new_weight = d->weight; + entity->my_sched_data = &bfqg->sched_data; + bfqg->my_entity = entity; /* + * the root_group's will be set to NULL + * in bfq_init_queue() + */ + bfqg->bfqd = bfqd; + bfqg->active_entities = 0; + bfqg->rq_pos_tree = RB_ROOT; +} + +static void bfq_pd_free(struct blkg_policy_data *pd) +{ + struct bfq_group *bfqg = pd_to_bfqg(pd); + + bfqg_stats_exit(&bfqg->stats); + bfqg_stats_exit(&bfqg->dead_stats); + + return kfree(bfqg); +} + +/* offset delta from bfqg->stats to bfqg->dead_stats */ +static const int dead_stats_off_delta = offsetof(struct bfq_group, dead_stats) - + offsetof(struct bfq_group, stats); + +/* to be used by recursive prfill, sums live and dead stats recursively */ +static u64 bfqg_stat_pd_recursive_sum(struct blkg_policy_data *pd, int off) +{ + u64 sum = 0; + + sum += blkg_stat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq, off); + sum += blkg_stat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq, + off + dead_stats_off_delta); + return sum; +} + +/* to be used by recursive prfill, sums live and dead rwstats recursively */ +static struct blkg_rwstat bfqg_rwstat_pd_recursive_sum(struct blkg_policy_data *pd, + int off) +{ + struct blkg_rwstat a, b; + + a = blkg_rwstat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq, off); + b = blkg_rwstat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq, + off + dead_stats_off_delta); + blkg_rwstat_add_aux(&a, &b); + return a; +} + +static void bfq_pd_reset_stats(struct blkg_policy_data *pd) +{ + struct bfq_group *bfqg = pd_to_bfqg(pd); + + bfqg_stats_reset(&bfqg->stats); + bfqg_stats_reset(&bfqg->dead_stats); +} + +static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd, + struct blkcg *blkcg) +{ + struct request_queue *q = bfqd->queue; + struct bfq_group *bfqg = NULL, *parent; + struct bfq_entity *entity = NULL; + + assert_spin_locked(bfqd->queue->queue_lock); + + /* avoid lookup for the common case where there's no blkcg */ + if (blkcg == &blkcg_root) { + bfqg = bfqd->root_group; + } else { + struct blkcg_gq *blkg; + + blkg = blkg_lookup_create(blkcg, q); + if (!IS_ERR(blkg)) + bfqg = blkg_to_bfqg(blkg); + else /* fallback to root_group */ + bfqg = bfqd->root_group; + } + + BUG_ON(!bfqg); + + /* + * Update chain of bfq_groups as we might be handling a leaf group + * which, along with some of its relatives, has not been hooked yet + * to the private hierarchy of BFQ. + */ + entity = &bfqg->entity; + for_each_entity(entity) { + bfqg = container_of(entity, struct bfq_group, entity); + BUG_ON(!bfqg); + if (bfqg != bfqd->root_group) { + parent = bfqg_parent(bfqg); + if (!parent) + parent = bfqd->root_group; + BUG_ON(!parent); + bfq_group_set_parent(bfqg, parent); + } + } + + return bfqg; +} + +static void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq); + +/** + * bfq_bfqq_move - migrate @bfqq to @bfqg. + * @bfqd: queue descriptor. + * @bfqq: the queue to move. + * @entity: @bfqq's entity. + * @bfqg: the group to move to. + * + * Move @bfqq to @bfqg, deactivating it from its old group and reactivating + * it on the new one. Avoid putting the entity on the old group idle tree. + * + * Must be called under the queue lock; the cgroup owning @bfqg must + * not disappear (by now this just means that we are called under + * rcu_read_lock()). + */ +static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, + struct bfq_entity *entity, struct bfq_group *bfqg) +{ + int busy, resume; + + busy = bfq_bfqq_busy(bfqq); + resume = !RB_EMPTY_ROOT(&bfqq->sort_list); + + BUG_ON(resume && !entity->on_st); + BUG_ON(busy && !resume && entity->on_st && + bfqq != bfqd->in_service_queue); + + if (busy) { + BUG_ON(atomic_read(&bfqq->ref) < 2); + + if (!resume) + bfq_del_bfqq_busy(bfqd, bfqq, 0); + else + bfq_deactivate_bfqq(bfqd, bfqq, 0); + } else if (entity->on_st) + bfq_put_idle_entity(bfq_entity_service_tree(entity), entity); + bfqg_put(bfqq_group(bfqq)); + + /* + * Here we use a reference to bfqg. We don't need a refcounter + * as the cgroup reference will not be dropped, so that its + * destroy() callback will not be invoked. + */ + entity->parent = bfqg->my_entity; + entity->sched_data = &bfqg->sched_data; + bfqg_get(bfqg); + + if (busy) { + bfq_pos_tree_add_move(bfqd, bfqq); + if (resume) + bfq_activate_bfqq(bfqd, bfqq); + } + + if (!bfqd->in_service_queue && !bfqd->rq_in_driver) + bfq_schedule_dispatch(bfqd); +} + +/** + * __bfq_bic_change_cgroup - move @bic to @cgroup. + * @bfqd: the queue descriptor. + * @bic: the bic to move. + * @blkcg: the blk-cgroup to move to. + * + * Move bic to blkcg, assuming that bfqd->queue is locked; the caller + * has to make sure that the reference to cgroup is valid across the call. + * + * NOTE: an alternative approach might have been to store the current + * cgroup in bfqq and getting a reference to it, reducing the lookup + * time here, at the price of slightly more complex code. + */ +static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd, + struct bfq_io_cq *bic, + struct blkcg *blkcg) +{ + struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0); + struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1); + struct bfq_group *bfqg; + struct bfq_entity *entity; + + lockdep_assert_held(bfqd->queue->queue_lock); + + bfqg = bfq_find_alloc_group(bfqd, blkcg); + if (async_bfqq) { + entity = &async_bfqq->entity; + + if (entity->sched_data != &bfqg->sched_data) { + bic_set_bfqq(bic, NULL, 0); + bfq_log_bfqq(bfqd, async_bfqq, + "bic_change_group: %p %d", + async_bfqq, atomic_read(&async_bfqq->ref)); + bfq_put_queue(async_bfqq); + } + } + + if (sync_bfqq) { + entity = &sync_bfqq->entity; + if (entity->sched_data != &bfqg->sched_data) + bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg); + } + + return bfqg; +} + +static void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) +{ + struct bfq_data *bfqd = bic_to_bfqd(bic); + struct blkcg *blkcg; + struct bfq_group *bfqg = NULL; + uint64_t id; + + rcu_read_lock(); + blkcg = bio_blkcg(bio); + id = blkcg->css.serial_nr; + rcu_read_unlock(); + + /* + * Check whether blkcg has changed. The condition may trigger + * spuriously on a newly created cic but there's no harm. + */ + if (unlikely(!bfqd) || likely(bic->blkcg_id == id)) + return; + + bfqg = __bfq_bic_change_cgroup(bfqd, bic, blkcg); + BUG_ON(!bfqg); + bic->blkcg_id = id; +} + +/** + * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st. + * @st: the service tree being flushed. + */ +static void bfq_flush_idle_tree(struct bfq_service_tree *st) +{ + struct bfq_entity *entity = st->first_idle; + + for (; entity ; entity = st->first_idle) + __bfq_deactivate_entity(entity, 0); +} + +/** + * bfq_reparent_leaf_entity - move leaf entity to the root_group. + * @bfqd: the device data structure with the root group. + * @entity: the entity to move. + */ +static void bfq_reparent_leaf_entity(struct bfq_data *bfqd, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + BUG_ON(!bfqq); + bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group); + return; +} + +/** + * bfq_reparent_active_entities - move to the root group all active + * entities. + * @bfqd: the device data structure with the root group. + * @bfqg: the group to move from. + * @st: the service tree with the entities. + * + * Needs queue_lock to be taken and reference to be valid over the call. + */ +static void bfq_reparent_active_entities(struct bfq_data *bfqd, + struct bfq_group *bfqg, + struct bfq_service_tree *st) +{ + struct rb_root *active = &st->active; + struct bfq_entity *entity = NULL; + + if (!RB_EMPTY_ROOT(&st->active)) + entity = bfq_entity_of(rb_first(active)); + + for (; entity ; entity = bfq_entity_of(rb_first(active))) + bfq_reparent_leaf_entity(bfqd, entity); + + if (bfqg->sched_data.in_service_entity) + bfq_reparent_leaf_entity(bfqd, + bfqg->sched_data.in_service_entity); + + return; +} + +/** + * bfq_destroy_group - destroy @bfqg. + * @bfqg: the group being destroyed. + * + * Destroy @bfqg, making sure that it is not referenced from its parent. + * blkio already grabs the queue_lock for us, so no need to use RCU-based magic + */ +static void bfq_pd_offline(struct blkg_policy_data *pd) +{ + struct bfq_service_tree *st; + struct bfq_group *bfqg; + struct bfq_data *bfqd; + struct bfq_entity *entity; + int i; + + BUG_ON(!pd); + bfqg = pd_to_bfqg(pd); + BUG_ON(!bfqg); + bfqd = bfqg->bfqd; + BUG_ON(bfqd && !bfqd->root_group); + + entity = bfqg->my_entity; + + if (!entity) /* root group */ + return; + + /* + * Empty all service_trees belonging to this group before + * deactivating the group itself. + */ + for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) { + BUG_ON(!bfqg->sched_data.service_tree); + st = bfqg->sched_data.service_tree + i; + /* + * The idle tree may still contain bfq_queues belonging + * to exited task because they never migrated to a different + * cgroup from the one being destroyed now. No one else + * can access them so it's safe to act without any lock. + */ + bfq_flush_idle_tree(st); + + /* + * It may happen that some queues are still active + * (busy) upon group destruction (if the corresponding + * processes have been forced to terminate). We move + * all the leaf entities corresponding to these queues + * to the root_group. + * Also, it may happen that the group has an entity + * in service, which is disconnected from the active + * tree: it must be moved, too. + * There is no need to put the sync queues, as the + * scheduler has taken no reference. + */ + bfq_reparent_active_entities(bfqd, bfqg, st); + BUG_ON(!RB_EMPTY_ROOT(&st->active)); + BUG_ON(!RB_EMPTY_ROOT(&st->idle)); + } + BUG_ON(bfqg->sched_data.next_in_service); + BUG_ON(bfqg->sched_data.in_service_entity); + + __bfq_deactivate_entity(entity, 0); + bfq_put_async_queues(bfqd, bfqg); + BUG_ON(entity->tree); + + bfqg_stats_xfer_dead(bfqg); +} + +static void bfq_end_wr_async(struct bfq_data *bfqd) +{ + struct blkcg_gq *blkg; + + list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) { + struct bfq_group *bfqg = blkg_to_bfqg(blkg); + + bfq_end_wr_async_queues(bfqd, bfqg); + } + bfq_end_wr_async_queues(bfqd, bfqd->root_group); +} + +static u64 bfqio_cgroup_weight_read(struct cgroup_subsys_state *css, + struct cftype *cftype) +{ + struct blkcg *blkcg = css_to_blkcg(css); + struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg); + int ret = -EINVAL; + + spin_lock_irq(&blkcg->lock); + ret = bfqgd->weight; + spin_unlock_irq(&blkcg->lock); + + return ret; +} + +static int bfqio_cgroup_weight_read_dfl(struct seq_file *sf, void *v) +{ + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); + struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg); + + spin_lock_irq(&blkcg->lock); + seq_printf(sf, "%u\n", bfqgd->weight); + spin_unlock_irq(&blkcg->lock); + + return 0; +} + +static int bfqio_cgroup_weight_write(struct cgroup_subsys_state *css, + struct cftype *cftype, + u64 val) +{ + struct blkcg *blkcg = css_to_blkcg(css); + struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg); + struct blkcg_gq *blkg; + int ret = -EINVAL; + + if (val < BFQ_MIN_WEIGHT || val > BFQ_MAX_WEIGHT) + return ret; + + ret = 0; + spin_lock_irq(&blkcg->lock); + bfqgd->weight = (unsigned short)val; + hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) { + struct bfq_group *bfqg = blkg_to_bfqg(blkg); + if (!bfqg) + continue; + /* + * Setting the prio_changed flag of the entity + * to 1 with new_weight == weight would re-set + * the value of the weight to its ioprio mapping. + * Set the flag only if necessary. + */ + if ((unsigned short)val != bfqg->entity.new_weight) { + bfqg->entity.new_weight = (unsigned short)val; + /* + * Make sure that the above new value has been + * stored in bfqg->entity.new_weight before + * setting the prio_changed flag. In fact, + * this flag may be read asynchronously (in + * critical sections protected by a different + * lock than that held here), and finding this + * flag set may cause the execution of the code + * for updating parameters whose value may + * depend also on bfqg->entity.new_weight (in + * __bfq_entity_update_weight_prio). + * This barrier makes sure that the new value + * of bfqg->entity.new_weight is correctly + * seen in that code. + */ + smp_wmb(); + bfqg->entity.prio_changed = 1; + } + } + spin_unlock_irq(&blkcg->lock); + + return ret; +} + +static ssize_t bfqio_cgroup_weight_write_dfl(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + /* First unsigned long found in the file is used */ + return bfqio_cgroup_weight_write(of_css(of), NULL, + simple_strtoull(strim(buf), NULL, 0)); +} + +static int bfqg_print_stat(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_stat, + &blkcg_policy_bfq, seq_cft(sf)->private, false); + return 0; +} + +static int bfqg_print_rwstat(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_rwstat, + &blkcg_policy_bfq, seq_cft(sf)->private, true); + return 0; +} + +static u64 bfqg_prfill_stat_recursive(struct seq_file *sf, + struct blkg_policy_data *pd, int off) +{ + u64 sum = bfqg_stat_pd_recursive_sum(pd, off); + + return __blkg_prfill_u64(sf, pd, sum); +} + +static u64 bfqg_prfill_rwstat_recursive(struct seq_file *sf, + struct blkg_policy_data *pd, int off) +{ + struct blkg_rwstat sum = bfqg_rwstat_pd_recursive_sum(pd, off); + + return __blkg_prfill_rwstat(sf, pd, &sum); +} + +static int bfqg_print_stat_recursive(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), + bfqg_prfill_stat_recursive, &blkcg_policy_bfq, + seq_cft(sf)->private, false); + return 0; +} + +static int bfqg_print_rwstat_recursive(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), + bfqg_prfill_rwstat_recursive, &blkcg_policy_bfq, + seq_cft(sf)->private, true); + return 0; +} + +static u64 bfqg_prfill_avg_queue_size(struct seq_file *sf, + struct blkg_policy_data *pd, int off) +{ + struct bfq_group *bfqg = pd_to_bfqg(pd); + u64 samples = blkg_stat_read(&bfqg->stats.avg_queue_size_samples); + u64 v = 0; + + if (samples) { + v = blkg_stat_read(&bfqg->stats.avg_queue_size_sum); + v = div64_u64(v, samples); + } + __blkg_prfill_u64(sf, pd, v); + return 0; +} + +/* print avg_queue_size */ +static int bfqg_print_avg_queue_size(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), + bfqg_prfill_avg_queue_size, &blkcg_policy_bfq, + 0, false); + return 0; +} + +static struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node) +{ + int ret; + + ret = blkcg_activate_policy(bfqd->queue, &blkcg_policy_bfq); + if (ret) + return NULL; + + return blkg_to_bfqg(bfqd->queue->root_blkg); +} + +static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp) +{ + struct bfq_group_data *bgd; + + bgd = kzalloc(sizeof(*bgd), GFP_KERNEL); + if (!bgd) + return NULL; + return &bgd->pd; +} + +static void bfq_cpd_free(struct blkcg_policy_data *cpd) +{ + kfree(cpd_to_bfqgd(cpd)); +} + +static struct cftype bfqio_files_dfl[] = { + { + .name = "weight", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = bfqio_cgroup_weight_read_dfl, + .write = bfqio_cgroup_weight_write_dfl, + }, + {} /* terminate */ +}; + +static struct cftype bfqio_files[] = { + { + .name = "bfq.weight", + .read_u64 = bfqio_cgroup_weight_read, + .write_u64 = bfqio_cgroup_weight_write, + }, + /* statistics, cover only the tasks in the bfqg */ + { + .name = "bfq.time", + .private = offsetof(struct bfq_group, stats.time), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.sectors", + .private = offsetof(struct bfq_group, stats.sectors), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.io_service_bytes", + .private = offsetof(struct bfq_group, stats.service_bytes), + .seq_show = bfqg_print_rwstat, + }, + { + .name = "bfq.io_serviced", + .private = offsetof(struct bfq_group, stats.serviced), + .seq_show = bfqg_print_rwstat, + }, + { + .name = "bfq.io_service_time", + .private = offsetof(struct bfq_group, stats.service_time), + .seq_show = bfqg_print_rwstat, + }, + { + .name = "bfq.io_wait_time", + .private = offsetof(struct bfq_group, stats.wait_time), + .seq_show = bfqg_print_rwstat, + }, + { + .name = "bfq.io_merged", + .private = offsetof(struct bfq_group, stats.merged), + .seq_show = bfqg_print_rwstat, + }, + { + .name = "bfq.io_queued", + .private = offsetof(struct bfq_group, stats.queued), + .seq_show = bfqg_print_rwstat, + }, + + /* the same statictics which cover the bfqg and its descendants */ + { + .name = "bfq.time_recursive", + .private = offsetof(struct bfq_group, stats.time), + .seq_show = bfqg_print_stat_recursive, + }, + { + .name = "bfq.sectors_recursive", + .private = offsetof(struct bfq_group, stats.sectors), + .seq_show = bfqg_print_stat_recursive, + }, + { + .name = "bfq.io_service_bytes_recursive", + .private = offsetof(struct bfq_group, stats.service_bytes), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.io_serviced_recursive", + .private = offsetof(struct bfq_group, stats.serviced), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.io_service_time_recursive", + .private = offsetof(struct bfq_group, stats.service_time), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.io_wait_time_recursive", + .private = offsetof(struct bfq_group, stats.wait_time), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.io_merged_recursive", + .private = offsetof(struct bfq_group, stats.merged), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.io_queued_recursive", + .private = offsetof(struct bfq_group, stats.queued), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.avg_queue_size", + .seq_show = bfqg_print_avg_queue_size, + }, + { + .name = "bfq.group_wait_time", + .private = offsetof(struct bfq_group, stats.group_wait_time), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.idle_time", + .private = offsetof(struct bfq_group, stats.idle_time), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.empty_time", + .private = offsetof(struct bfq_group, stats.empty_time), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.dequeue", + .private = offsetof(struct bfq_group, stats.dequeue), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.unaccounted_time", + .private = offsetof(struct bfq_group, stats.unaccounted_time), + .seq_show = bfqg_print_stat, + }, + { } /* terminate */ +}; + +static struct blkcg_policy blkcg_policy_bfq = { + .dfl_cftypes = bfqio_files_dfl, + .legacy_cftypes = bfqio_files, + + .pd_alloc_fn = bfq_pd_alloc, + .pd_init_fn = bfq_pd_init, + .pd_offline_fn = bfq_pd_offline, + .pd_free_fn = bfq_pd_free, + .pd_reset_stats_fn = bfq_pd_reset_stats, + + .cpd_alloc_fn = bfq_cpd_alloc, + .cpd_init_fn = bfq_cpd_init, + .cpd_bind_fn = bfq_cpd_init, + .cpd_free_fn = bfq_cpd_free, + +}; + +#else + +static void bfq_init_entity(struct bfq_entity *entity, + struct bfq_group *bfqg) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + entity->weight = entity->new_weight; + entity->orig_weight = entity->new_weight; + if (bfqq) { + bfqq->ioprio = bfqq->new_ioprio; + bfqq->ioprio_class = bfqq->new_ioprio_class; + } + entity->sched_data = &bfqg->sched_data; +} + +static struct bfq_group * +bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) +{ + struct bfq_data *bfqd = bic_to_bfqd(bic); + return bfqd->root_group; +} + +static void bfq_bfqq_move(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + struct bfq_entity *entity, + struct bfq_group *bfqg) +{ +} + +static void bfq_end_wr_async(struct bfq_data *bfqd) +{ + bfq_end_wr_async_queues(bfqd, bfqd->root_group); +} + +static void bfq_disconnect_groups(struct bfq_data *bfqd) +{ + bfq_put_async_queues(bfqd, bfqd->root_group); +} + +static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd, + struct blkcg *blkcg) +{ + return bfqd->root_group; +} + +static struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node) +{ + struct bfq_group *bfqg; + int i; + + bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node); + if (!bfqg) + return NULL; + + for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) + bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT; + + return bfqg; +} +#endif diff --git b/block/bfq-ioc.c b/block/bfq-ioc.c new file mode 100644 index 0000000..fb7bb8f --- /dev/null +++ b/block/bfq-ioc.c @@ -0,0 +1,36 @@ +/* + * BFQ: I/O context handling. + * + * Based on ideas and code from CFQ: + * Copyright (C) 2003 Jens Axboe + * + * Copyright (C) 2008 Fabio Checconi + * Paolo Valente + * + * Copyright (C) 2010 Paolo Valente + */ + +/** + * icq_to_bic - convert iocontext queue structure to bfq_io_cq. + * @icq: the iocontext queue. + */ +static struct bfq_io_cq *icq_to_bic(struct io_cq *icq) +{ + /* bic->icq is the first member, %NULL will convert to %NULL */ + return container_of(icq, struct bfq_io_cq, icq); +} + +/** + * bfq_bic_lookup - search into @ioc a bic associated to @bfqd. + * @bfqd: the lookup key. + * @ioc: the io_context of the process doing I/O. + * + * Queue lock must be held. + */ +static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd, + struct io_context *ioc) +{ + if (ioc) + return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue)); + return NULL; +} diff --git b/block/bfq-iosched.c b/block/bfq-iosched.c new file mode 100644 index 0000000..d1f648d --- /dev/null +++ b/block/bfq-iosched.c @@ -0,0 +1,4413 @@ +/* + * Budget Fair Queueing (BFQ) disk scheduler. + * + * Based on ideas and code from CFQ: + * Copyright (C) 2003 Jens Axboe + * + * Copyright (C) 2008 Fabio Checconi + * Paolo Valente + * + * Copyright (C) 2010 Paolo Valente + * + * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ + * file. + * + * BFQ is a proportional-share storage-I/O scheduling algorithm based on + * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets, + * measured in number of sectors, to processes instead of time slices. The + * device is not granted to the in-service process for a given time slice, + * but until it has exhausted its assigned budget. This change from the time + * to the service domain allows BFQ to distribute the device throughput + * among processes as desired, without any distortion due to ZBR, workload + * fluctuations or other factors. BFQ uses an ad hoc internal scheduler, + * called B-WF2Q+, to schedule processes according to their budgets. More + * precisely, BFQ schedules queues associated to processes. Thanks to the + * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to + * I/O-bound processes issuing sequential requests (to boost the + * throughput), and yet guarantee a low latency to interactive and soft + * real-time applications. + * + * BFQ is described in [1], where also a reference to the initial, more + * theoretical paper on BFQ can be found. The interested reader can find + * in the latter paper full details on the main algorithm, as well as + * formulas of the guarantees and formal proofs of all the properties. + * With respect to the version of BFQ presented in these papers, this + * implementation adds a few more heuristics, such as the one that + * guarantees a low latency to soft real-time applications, and a + * hierarchical extension based on H-WF2Q+. + * + * B-WF2Q+ is based on WF2Q+, that is described in [2], together with + * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N) + * complexity derives from the one introduced with EEVDF in [3]. + * + * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness + * with the BFQ Disk I/O Scheduler'', + * Proceedings of the 5th Annual International Systems and Storage + * Conference (SYSTOR '12), June 2012. + * + * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf + * + * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing + * Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689, + * Oct 1997. + * + * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz + * + * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline + * First: A Flexible and Accurate Mechanism for Proportional Share + * Resource Allocation,'' technical report. + * + * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include "bfq.h" +#include "blk.h" + +/* Expiration time of sync (0) and async (1) requests, in jiffies. */ +static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 }; + +/* Maximum backwards seek, in KiB. */ +static const int bfq_back_max = 16 * 1024; + +/* Penalty of a backwards seek, in number of sectors. */ +static const int bfq_back_penalty = 2; + +/* Idling period duration, in jiffies. */ +static int bfq_slice_idle = HZ / 125; + +/* Minimum number of assigned budgets for which stats are safe to compute. */ +static const int bfq_stats_min_budgets = 194; + +/* Default maximum budget values, in sectors and number of requests. */ +static const int bfq_default_max_budget = 16 * 1024; +static const int bfq_max_budget_async_rq = 4; + +/* + * Async to sync throughput distribution is controlled as follows: + * when an async request is served, the entity is charged the number + * of sectors of the request, multiplied by the factor below + */ +static const int bfq_async_charge_factor = 10; + +/* Default timeout values, in jiffies, approximating CFQ defaults. */ +static const int bfq_timeout_sync = HZ / 8; +static int bfq_timeout_async = HZ / 25; + +struct kmem_cache *bfq_pool; + +/* Below this threshold (in ms), we consider thinktime immediate. */ +#define BFQ_MIN_TT 2 + +/* hw_tag detection: parallel requests threshold and min samples needed. */ +#define BFQ_HW_QUEUE_THRESHOLD 4 +#define BFQ_HW_QUEUE_SAMPLES 32 + +#define BFQQ_SEEK_THR (sector_t)(8 * 1024) +#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR) + +/* Min samples used for peak rate estimation (for autotuning). */ +#define BFQ_PEAK_RATE_SAMPLES 32 + +/* Shift used for peak rate fixed precision calculations. */ +#define BFQ_RATE_SHIFT 16 + +/* + * By default, BFQ computes the duration of the weight raising for + * interactive applications automatically, using the following formula: + * duration = (R / r) * T, where r is the peak rate of the device, and + * R and T are two reference parameters. + * In particular, R is the peak rate of the reference device (see below), + * and T is a reference time: given the systems that are likely to be + * installed on the reference device according to its speed class, T is + * about the maximum time needed, under BFQ and while reading two files in + * parallel, to load typical large applications on these systems. + * In practice, the slower/faster the device at hand is, the more/less it + * takes to load applications with respect to the reference device. + * Accordingly, the longer/shorter BFQ grants weight raising to interactive + * applications. + * + * BFQ uses four different reference pairs (R, T), depending on: + * . whether the device is rotational or non-rotational; + * . whether the device is slow, such as old or portable HDDs, as well as + * SD cards, or fast, such as newer HDDs and SSDs. + * + * The device's speed class is dynamically (re)detected in + * bfq_update_peak_rate() every time the estimated peak rate is updated. + * + * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0] + * are the reference values for a slow/fast rotational device, whereas + * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for + * a slow/fast non-rotational device. Finally, device_speed_thresh are the + * thresholds used to switch between speed classes. + * Both the reference peak rates and the thresholds are measured in + * sectors/usec, left-shifted by BFQ_RATE_SHIFT. + */ +static int R_slow[2] = {1536, 10752}; +static int R_fast[2] = {17415, 34791}; +/* + * To improve readability, a conversion function is used to initialize the + * following arrays, which entails that they can be initialized only in a + * function. + */ +static int T_slow[2]; +static int T_fast[2]; +static int device_speed_thresh[2]; + +#define BFQ_SERVICE_TREE_INIT ((struct bfq_service_tree) \ + { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 }) + +#define RQ_BIC(rq) ((struct bfq_io_cq *) (rq)->elv.priv[0]) +#define RQ_BFQQ(rq) ((rq)->elv.priv[1]) + +static void bfq_schedule_dispatch(struct bfq_data *bfqd); + +#include "bfq-ioc.c" +#include "bfq-sched.c" +#include "bfq-cgroup.c" + +#define bfq_class_idle(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE) +#define bfq_class_rt(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_RT) + +#define bfq_sample_valid(samples) ((samples) > 80) + +/* + * We regard a request as SYNC, if either it's a read or has the SYNC bit + * set (in which case it could also be a direct WRITE). + */ +static int bfq_bio_sync(struct bio *bio) +{ + if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC)) + return 1; + + return 0; +} + +/* + * Scheduler run of queue, if there are requests pending and no one in the + * driver that will restart queueing. + */ +static void bfq_schedule_dispatch(struct bfq_data *bfqd) +{ + if (bfqd->queued != 0) { + bfq_log(bfqd, "schedule dispatch"); + kblockd_schedule_work(&bfqd->unplug_work); + } +} + +/* + * Lifted from AS - choose which of rq1 and rq2 that is best served now. + * We choose the request that is closesr to the head right now. Distance + * behind the head is penalized and only allowed to a certain extent. + */ +static struct request *bfq_choose_req(struct bfq_data *bfqd, + struct request *rq1, + struct request *rq2, + sector_t last) +{ + sector_t s1, s2, d1 = 0, d2 = 0; + unsigned long back_max; +#define BFQ_RQ1_WRAP 0x01 /* request 1 wraps */ +#define BFQ_RQ2_WRAP 0x02 /* request 2 wraps */ + unsigned wrap = 0; /* bit mask: requests behind the disk head? */ + + if (!rq1 || rq1 == rq2) + return rq2; + if (!rq2) + return rq1; + + if (rq_is_sync(rq1) && !rq_is_sync(rq2)) + return rq1; + else if (rq_is_sync(rq2) && !rq_is_sync(rq1)) + return rq2; + if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META)) + return rq1; + else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META)) + return rq2; + + s1 = blk_rq_pos(rq1); + s2 = blk_rq_pos(rq2); + + /* + * By definition, 1KiB is 2 sectors. + */ + back_max = bfqd->bfq_back_max * 2; + + /* + * Strict one way elevator _except_ in the case where we allow + * short backward seeks which are biased as twice the cost of a + * similar forward seek. + */ + if (s1 >= last) + d1 = s1 - last; + else if (s1 + back_max >= last) + d1 = (last - s1) * bfqd->bfq_back_penalty; + else + wrap |= BFQ_RQ1_WRAP; + + if (s2 >= last) + d2 = s2 - last; + else if (s2 + back_max >= last) + d2 = (last - s2) * bfqd->bfq_back_penalty; + else + wrap |= BFQ_RQ2_WRAP; + + /* Found required data */ + + /* + * By doing switch() on the bit mask "wrap" we avoid having to + * check two variables for all permutations: --> faster! + */ + switch (wrap) { + case 0: /* common case for CFQ: rq1 and rq2 not wrapped */ + if (d1 < d2) + return rq1; + else if (d2 < d1) + return rq2; + else { + if (s1 >= s2) + return rq1; + else + return rq2; + } + + case BFQ_RQ2_WRAP: + return rq1; + case BFQ_RQ1_WRAP: + return rq2; + case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */ + default: + /* + * Since both rqs are wrapped, + * start with the one that's further behind head + * (--> only *one* back seek required), + * since back seek takes more time than forward. + */ + if (s1 <= s2) + return rq1; + else + return rq2; + } +} + +static struct bfq_queue * +bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root, + sector_t sector, struct rb_node **ret_parent, + struct rb_node ***rb_link) +{ + struct rb_node **p, *parent; + struct bfq_queue *bfqq = NULL; + + parent = NULL; + p = &root->rb_node; + while (*p) { + struct rb_node **n; + + parent = *p; + bfqq = rb_entry(parent, struct bfq_queue, pos_node); + + /* + * Sort strictly based on sector. Smallest to the left, + * largest to the right. + */ + if (sector > blk_rq_pos(bfqq->next_rq)) + n = &(*p)->rb_right; + else if (sector < blk_rq_pos(bfqq->next_rq)) + n = &(*p)->rb_left; + else + break; + p = n; + bfqq = NULL; + } + + *ret_parent = parent; + if (rb_link) + *rb_link = p; + + bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d", + (long long unsigned)sector, + bfqq ? bfqq->pid : 0); + + return bfqq; +} + +static void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + struct rb_node **p, *parent; + struct bfq_queue *__bfqq; + + if (bfqq->pos_root) { + rb_erase(&bfqq->pos_node, bfqq->pos_root); + bfqq->pos_root = NULL; + } + + if (bfq_class_idle(bfqq)) + return; + if (!bfqq->next_rq) + return; + + bfqq->pos_root = &bfq_bfqq_to_bfqg(bfqq)->rq_pos_tree; + __bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root, + blk_rq_pos(bfqq->next_rq), &parent, &p); + if (!__bfqq) { + rb_link_node(&bfqq->pos_node, parent, p); + rb_insert_color(&bfqq->pos_node, bfqq->pos_root); + } else + bfqq->pos_root = NULL; +} + +/* + * Tell whether there are active queues or groups with differentiated weights. + */ +static bool bfq_differentiated_weights(struct bfq_data *bfqd) +{ + /* + * For weights to differ, at least one of the trees must contain + * at least two nodes. + */ + return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) && + (bfqd->queue_weights_tree.rb_node->rb_left || + bfqd->queue_weights_tree.rb_node->rb_right) +#ifdef CONFIG_BFQ_GROUP_IOSCHED + ) || + (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) && + (bfqd->group_weights_tree.rb_node->rb_left || + bfqd->group_weights_tree.rb_node->rb_right) +#endif + ); +} + +/* + * The following function returns true if every queue must receive the + * same share of the throughput (this condition is used when deciding + * whether idling may be disabled, see the comments in the function + * bfq_bfqq_may_idle()). + * + * Such a scenario occurs when: + * 1) all active queues have the same weight, + * 2) all active groups at the same level in the groups tree have the same + * weight, + * 3) all active groups at the same level in the groups tree have the same + * number of children. + * + * Unfortunately, keeping the necessary state for evaluating exactly the + * above symmetry conditions would be quite complex and time-consuming. + * Therefore this function evaluates, instead, the following stronger + * sub-conditions, for which it is much easier to maintain the needed + * state: + * 1) all active queues have the same weight, + * 2) all active groups have the same weight, + * 3) all active groups have at most one active child each. + * In particular, the last two conditions are always true if hierarchical + * support and the cgroups interface are not enabled, thus no state needs + * to be maintained in this case. + */ +static bool bfq_symmetric_scenario(struct bfq_data *bfqd) +{ + return +#ifdef CONFIG_BFQ_GROUP_IOSCHED + !bfqd->active_numerous_groups && +#endif + !bfq_differentiated_weights(bfqd); +} + +/* + * If the weight-counter tree passed as input contains no counter for + * the weight of the input entity, then add that counter; otherwise just + * increment the existing counter. + * + * Note that weight-counter trees contain few nodes in mostly symmetric + * scenarios. For example, if all queues have the same weight, then the + * weight-counter tree for the queues may contain at most one node. + * This holds even if low_latency is on, because weight-raised queues + * are not inserted in the tree. + * In most scenarios, the rate at which nodes are created/destroyed + * should be low too. + */ +static void bfq_weights_tree_add(struct bfq_data *bfqd, + struct bfq_entity *entity, + struct rb_root *root) +{ + struct rb_node **new = &(root->rb_node), *parent = NULL; + + /* + * Do not insert if the entity is already associated with a + * counter, which happens if: + * 1) the entity is associated with a queue, + * 2) a request arrival has caused the queue to become both + * non-weight-raised, and hence change its weight, and + * backlogged; in this respect, each of the two events + * causes an invocation of this function, + * 3) this is the invocation of this function caused by the + * second event. This second invocation is actually useless, + * and we handle this fact by exiting immediately. More + * efficient or clearer solutions might possibly be adopted. + */ + if (entity->weight_counter) + return; + + while (*new) { + struct bfq_weight_counter *__counter = container_of(*new, + struct bfq_weight_counter, + weights_node); + parent = *new; + + if (entity->weight == __counter->weight) { + entity->weight_counter = __counter; + goto inc_counter; + } + if (entity->weight < __counter->weight) + new = &((*new)->rb_left); + else + new = &((*new)->rb_right); + } + + entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter), + GFP_ATOMIC); + entity->weight_counter->weight = entity->weight; + rb_link_node(&entity->weight_counter->weights_node, parent, new); + rb_insert_color(&entity->weight_counter->weights_node, root); + +inc_counter: + entity->weight_counter->num_active++; +} + +/* + * Decrement the weight counter associated with the entity, and, if the + * counter reaches 0, remove the counter from the tree. + * See the comments to the function bfq_weights_tree_add() for considerations + * about overhead. + */ +static void bfq_weights_tree_remove(struct bfq_data *bfqd, + struct bfq_entity *entity, + struct rb_root *root) +{ + if (!entity->weight_counter) + return; + + BUG_ON(RB_EMPTY_ROOT(root)); + BUG_ON(entity->weight_counter->weight != entity->weight); + + BUG_ON(!entity->weight_counter->num_active); + entity->weight_counter->num_active--; + if (entity->weight_counter->num_active > 0) + goto reset_entity_pointer; + + rb_erase(&entity->weight_counter->weights_node, root); + kfree(entity->weight_counter); + +reset_entity_pointer: + entity->weight_counter = NULL; +} + +static struct request *bfq_find_next_rq(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + struct request *last) +{ + struct rb_node *rbnext = rb_next(&last->rb_node); + struct rb_node *rbprev = rb_prev(&last->rb_node); + struct request *next = NULL, *prev = NULL; + + BUG_ON(RB_EMPTY_NODE(&last->rb_node)); + + if (rbprev) + prev = rb_entry_rq(rbprev); + + if (rbnext) + next = rb_entry_rq(rbnext); + else { + rbnext = rb_first(&bfqq->sort_list); + if (rbnext && rbnext != &last->rb_node) + next = rb_entry_rq(rbnext); + } + + return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last)); +} + +/* see the definition of bfq_async_charge_factor for details */ +static unsigned long bfq_serv_to_charge(struct request *rq, + struct bfq_queue *bfqq) +{ + return blk_rq_sectors(rq) * + (1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) * + bfq_async_charge_factor)); +} + +/** + * bfq_updated_next_req - update the queue after a new next_rq selection. + * @bfqd: the device data the queue belongs to. + * @bfqq: the queue to update. + * + * If the first request of a queue changes we make sure that the queue + * has enough budget to serve at least its first request (if the + * request has grown). We do this because if the queue has not enough + * budget for its first request, it has to go through two dispatch + * rounds to actually get it dispatched. + */ +static void bfq_updated_next_req(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + struct bfq_service_tree *st = bfq_entity_service_tree(entity); + struct request *next_rq = bfqq->next_rq; + unsigned long new_budget; + + if (!next_rq) + return; + + if (bfqq == bfqd->in_service_queue) + /* + * In order not to break guarantees, budgets cannot be + * changed after an entity has been selected. + */ + return; + + BUG_ON(entity->tree != &st->active); + BUG_ON(entity == entity->sched_data->in_service_entity); + + new_budget = max_t(unsigned long, bfqq->max_budget, + bfq_serv_to_charge(next_rq, bfqq)); + if (entity->budget != new_budget) { + entity->budget = new_budget; + bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu", + new_budget); + bfq_activate_bfqq(bfqd, bfqq); + } +} + +static unsigned int bfq_wr_duration(struct bfq_data *bfqd) +{ + u64 dur; + + if (bfqd->bfq_wr_max_time > 0) + return bfqd->bfq_wr_max_time; + + dur = bfqd->RT_prod; + do_div(dur, bfqd->peak_rate); + + return dur; +} + +static unsigned bfq_bfqq_cooperations(struct bfq_queue *bfqq) +{ + return bfqq->bic ? bfqq->bic->cooperations : 0; +} + +static void +bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic) +{ + if (bic->saved_idle_window) + bfq_mark_bfqq_idle_window(bfqq); + else + bfq_clear_bfqq_idle_window(bfqq); + if (bic->saved_IO_bound) + bfq_mark_bfqq_IO_bound(bfqq); + else + bfq_clear_bfqq_IO_bound(bfqq); + /* Assuming that the flag in_large_burst is already correctly set */ + if (bic->wr_time_left && bfqq->bfqd->low_latency && + !bfq_bfqq_in_large_burst(bfqq) && + bic->cooperations < bfqq->bfqd->bfq_coop_thresh) { + /* + * Start a weight raising period with the duration given by + * the raising_time_left snapshot. + */ + if (bfq_bfqq_busy(bfqq)) + bfqq->bfqd->wr_busy_queues++; + bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff; + bfqq->wr_cur_max_time = bic->wr_time_left; + bfqq->last_wr_start_finish = jiffies; + bfqq->entity.prio_changed = 1; + } + /* + * Clear wr_time_left to prevent bfq_bfqq_save_state() from + * getting confused about the queue's need of a weight-raising + * period. + */ + bic->wr_time_left = 0; +} + +static int bfqq_process_refs(struct bfq_queue *bfqq) +{ + int process_refs, io_refs; + + lockdep_assert_held(bfqq->bfqd->queue->queue_lock); + + io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE]; + process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st; + BUG_ON(process_refs < 0); + return process_refs; +} + +/* Empty burst list and add just bfqq (see comments to bfq_handle_burst) */ +static void bfq_reset_burst_list(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + struct bfq_queue *item; + struct hlist_node *n; + + hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node) + hlist_del_init(&item->burst_list_node); + hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list); + bfqd->burst_size = 1; +} + +/* Add bfqq to the list of queues in current burst (see bfq_handle_burst) */ +static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + /* Increment burst size to take into account also bfqq */ + bfqd->burst_size++; + + if (bfqd->burst_size == bfqd->bfq_large_burst_thresh) { + struct bfq_queue *pos, *bfqq_item; + struct hlist_node *n; + + /* + * Enough queues have been activated shortly after each + * other to consider this burst as large. + */ + bfqd->large_burst = true; + + /* + * We can now mark all queues in the burst list as + * belonging to a large burst. + */ + hlist_for_each_entry(bfqq_item, &bfqd->burst_list, + burst_list_node) + bfq_mark_bfqq_in_large_burst(bfqq_item); + bfq_mark_bfqq_in_large_burst(bfqq); + + /* + * From now on, and until the current burst finishes, any + * new queue being activated shortly after the last queue + * was inserted in the burst can be immediately marked as + * belonging to a large burst. So the burst list is not + * needed any more. Remove it. + */ + hlist_for_each_entry_safe(pos, n, &bfqd->burst_list, + burst_list_node) + hlist_del_init(&pos->burst_list_node); + } else /* burst not yet large: add bfqq to the burst list */ + hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list); +} + +/* + * If many queues happen to become active shortly after each other, then, + * to help the processes associated to these queues get their job done as + * soon as possible, it is usually better to not grant either weight-raising + * or device idling to these queues. In this comment we describe, firstly, + * the reasons why this fact holds, and, secondly, the next function, which + * implements the main steps needed to properly mark these queues so that + * they can then be treated in a different way. + * + * As for the terminology, we say that a queue becomes active, i.e., + * switches from idle to backlogged, either when it is created (as a + * consequence of the arrival of an I/O request), or, if already existing, + * when a new request for the queue arrives while the queue is idle. + * Bursts of activations, i.e., activations of different queues occurring + * shortly after each other, are typically caused by services or applications + * that spawn or reactivate many parallel threads/processes. Examples are + * systemd during boot or git grep. + * + * These services or applications benefit mostly from a high throughput: + * the quicker the requests of the activated queues are cumulatively served, + * the sooner the target job of these queues gets completed. As a consequence, + * weight-raising any of these queues, which also implies idling the device + * for it, is almost always counterproductive: in most cases it just lowers + * throughput. + * + * On the other hand, a burst of activations may be also caused by the start + * of an application that does not consist in a lot of parallel I/O-bound + * threads. In fact, with a complex application, the burst may be just a + * consequence of the fact that several processes need to be executed to + * start-up the application. To start an application as quickly as possible, + * the best thing to do is to privilege the I/O related to the application + * with respect to all other I/O. Therefore, the best strategy to start as + * quickly as possible an application that causes a burst of activations is + * to weight-raise all the queues activated during the burst. This is the + * exact opposite of the best strategy for the other type of bursts. + * + * In the end, to take the best action for each of the two cases, the two + * types of bursts need to be distinguished. Fortunately, this seems + * relatively easy to do, by looking at the sizes of the bursts. In + * particular, we found a threshold such that bursts with a larger size + * than that threshold are apparently caused only by services or commands + * such as systemd or git grep. For brevity, hereafter we call just 'large' + * these bursts. BFQ *does not* weight-raise queues whose activations occur + * in a large burst. In addition, for each of these queues BFQ performs or + * does not perform idling depending on which choice boosts the throughput + * most. The exact choice depends on the device and request pattern at + * hand. + * + * Turning back to the next function, it implements all the steps needed + * to detect the occurrence of a large burst and to properly mark all the + * queues belonging to it (so that they can then be treated in a different + * way). This goal is achieved by maintaining a special "burst list" that + * holds, temporarily, the queues that belong to the burst in progress. The + * list is then used to mark these queues as belonging to a large burst if + * the burst does become large. The main steps are the following. + * + * . when the very first queue is activated, the queue is inserted into the + * list (as it could be the first queue in a possible burst) + * + * . if the current burst has not yet become large, and a queue Q that does + * not yet belong to the burst is activated shortly after the last time + * at which a new queue entered the burst list, then the function appends + * Q to the burst list + * + * . if, as a consequence of the previous step, the burst size reaches + * the large-burst threshold, then + * + * . all the queues in the burst list are marked as belonging to a + * large burst + * + * . the burst list is deleted; in fact, the burst list already served + * its purpose (keeping temporarily track of the queues in a burst, + * so as to be able to mark them as belonging to a large burst in the + * previous sub-step), and now is not needed any more + * + * . the device enters a large-burst mode + * + * . if a queue Q that does not belong to the burst is activated while + * the device is in large-burst mode and shortly after the last time + * at which a queue either entered the burst list or was marked as + * belonging to the current large burst, then Q is immediately marked + * as belonging to a large burst. + * + * . if a queue Q that does not belong to the burst is activated a while + * later, i.e., not shortly after, than the last time at which a queue + * either entered the burst list or was marked as belonging to the + * current large burst, then the current burst is deemed as finished and: + * + * . the large-burst mode is reset if set + * + * . the burst list is emptied + * + * . Q is inserted in the burst list, as Q may be the first queue + * in a possible new burst (then the burst list contains just Q + * after this step). + */ +static void bfq_handle_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool idle_for_long_time) +{ + /* + * If bfqq happened to be activated in a burst, but has been idle + * for at least as long as an interactive queue, then we assume + * that, in the overall I/O initiated in the burst, the I/O + * associated to bfqq is finished. So bfqq does not need to be + * treated as a queue belonging to a burst anymore. Accordingly, + * we reset bfqq's in_large_burst flag if set, and remove bfqq + * from the burst list if it's there. We do not decrement instead + * burst_size, because the fact that bfqq does not need to belong + * to the burst list any more does not invalidate the fact that + * bfqq may have been activated during the current burst. + */ + if (idle_for_long_time) { + hlist_del_init(&bfqq->burst_list_node); + bfq_clear_bfqq_in_large_burst(bfqq); + } + + /* + * If bfqq is already in the burst list or is part of a large + * burst, then there is nothing else to do. + */ + if (!hlist_unhashed(&bfqq->burst_list_node) || + bfq_bfqq_in_large_burst(bfqq)) + return; + + /* + * If bfqq's activation happens late enough, then the current + * burst is finished, and related data structures must be reset. + * + * In this respect, consider the special case where bfqq is the very + * first queue being activated. In this case, last_ins_in_burst is + * not yet significant when we get here. But it is easy to verify + * that, whether or not the following condition is true, bfqq will + * end up being inserted into the burst list. In particular the + * list will happen to contain only bfqq. And this is exactly what + * has to happen, as bfqq may be the first queue in a possible + * burst. + */ + if (time_is_before_jiffies(bfqd->last_ins_in_burst + + bfqd->bfq_burst_interval)) { + bfqd->large_burst = false; + bfq_reset_burst_list(bfqd, bfqq); + return; + } + + /* + * If we get here, then bfqq is being activated shortly after the + * last queue. So, if the current burst is also large, we can mark + * bfqq as belonging to this large burst immediately. + */ + if (bfqd->large_burst) { + bfq_mark_bfqq_in_large_burst(bfqq); + return; + } + + /* + * If we get here, then a large-burst state has not yet been + * reached, but bfqq is being activated shortly after the last + * queue. Then we add bfqq to the burst. + */ + bfq_add_to_burst(bfqd, bfqq); +} + +static void bfq_add_request(struct request *rq) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq); + struct bfq_entity *entity = &bfqq->entity; + struct bfq_data *bfqd = bfqq->bfqd; + struct request *next_rq, *prev; + unsigned long old_wr_coeff = bfqq->wr_coeff; + bool interactive = false; + + bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq)); + bfqq->queued[rq_is_sync(rq)]++; + bfqd->queued++; + + elv_rb_add(&bfqq->sort_list, rq); + + /* + * Check if this request is a better next-serve candidate. + */ + prev = bfqq->next_rq; + next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position); + BUG_ON(!next_rq); + bfqq->next_rq = next_rq; + + /* + * Adjust priority tree position, if next_rq changes. + */ + if (prev != bfqq->next_rq) + bfq_pos_tree_add_move(bfqd, bfqq); + + if (!bfq_bfqq_busy(bfqq)) { + bool soft_rt, coop_or_in_burst, + idle_for_long_time = time_is_before_jiffies( + bfqq->budget_timeout + + bfqd->bfq_wr_min_idle_time); + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_update_io_add(bfqq_group(RQ_BFQQ(rq)), bfqq, + rq->cmd_flags); +#endif + if (bfq_bfqq_sync(bfqq)) { + bool already_in_burst = + !hlist_unhashed(&bfqq->burst_list_node) || + bfq_bfqq_in_large_burst(bfqq); + bfq_handle_burst(bfqd, bfqq, idle_for_long_time); + /* + * If bfqq was not already in the current burst, + * then, at this point, bfqq either has been + * added to the current burst or has caused the + * current burst to terminate. In particular, in + * the second case, bfqq has become the first + * queue in a possible new burst. + * In both cases last_ins_in_burst needs to be + * moved forward. + */ + if (!already_in_burst) + bfqd->last_ins_in_burst = jiffies; + } + + coop_or_in_burst = bfq_bfqq_in_large_burst(bfqq) || + bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh; + soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 && + !coop_or_in_burst && + time_is_before_jiffies(bfqq->soft_rt_next_start); + interactive = !coop_or_in_burst && idle_for_long_time; + entity->budget = max_t(unsigned long, bfqq->max_budget, + bfq_serv_to_charge(next_rq, bfqq)); + + if (!bfq_bfqq_IO_bound(bfqq)) { + if (time_before(jiffies, + RQ_BIC(rq)->ttime.last_end_request + + bfqd->bfq_slice_idle)) { + bfqq->requests_within_timer++; + if (bfqq->requests_within_timer >= + bfqd->bfq_requests_within_timer) + bfq_mark_bfqq_IO_bound(bfqq); + } else + bfqq->requests_within_timer = 0; + } + + if (!bfqd->low_latency) + goto add_bfqq_busy; + + if (bfq_bfqq_just_split(bfqq)) + goto set_prio_changed; + + /* + * If the queue: + * - is not being boosted, + * - has been idle for enough time, + * - is not a sync queue or is linked to a bfq_io_cq (it is + * shared "for its nature" or it is not shared and its + * requests have not been redirected to a shared queue) + * start a weight-raising period. + */ + if (old_wr_coeff == 1 && (interactive || soft_rt) && + (!bfq_bfqq_sync(bfqq) || bfqq->bic)) { + bfqq->wr_coeff = bfqd->bfq_wr_coeff; + if (interactive) + bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); + else + bfqq->wr_cur_max_time = + bfqd->bfq_wr_rt_max_time; + bfq_log_bfqq(bfqd, bfqq, + "wrais starting at %lu, rais_max_time %u", + jiffies, + jiffies_to_msecs(bfqq->wr_cur_max_time)); + } else if (old_wr_coeff > 1) { + if (interactive) + bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); + else if (coop_or_in_burst || + (bfqq->wr_cur_max_time == + bfqd->bfq_wr_rt_max_time && + !soft_rt)) { + bfqq->wr_coeff = 1; + bfq_log_bfqq(bfqd, bfqq, + "wrais ending at %lu, rais_max_time %u", + jiffies, + jiffies_to_msecs(bfqq-> + wr_cur_max_time)); + } else if (time_before( + bfqq->last_wr_start_finish + + bfqq->wr_cur_max_time, + jiffies + + bfqd->bfq_wr_rt_max_time) && + soft_rt) { + /* + * + * The remaining weight-raising time is lower + * than bfqd->bfq_wr_rt_max_time, which means + * that the application is enjoying weight + * raising either because deemed soft-rt in + * the near past, or because deemed interactive + * a long ago. + * In both cases, resetting now the current + * remaining weight-raising time for the + * application to the weight-raising duration + * for soft rt applications would not cause any + * latency increase for the application (as the + * new duration would be higher than the + * remaining time). + * + * In addition, the application is now meeting + * the requirements for being deemed soft rt. + * In the end we can correctly and safely + * (re)charge the weight-raising duration for + * the application with the weight-raising + * duration for soft rt applications. + * + * In particular, doing this recharge now, i.e., + * before the weight-raising period for the + * application finishes, reduces the probability + * of the following negative scenario: + * 1) the weight of a soft rt application is + * raised at startup (as for any newly + * created application), + * 2) since the application is not interactive, + * at a certain time weight-raising is + * stopped for the application, + * 3) at that time the application happens to + * still have pending requests, and hence + * is destined to not have a chance to be + * deemed soft rt before these requests are + * completed (see the comments to the + * function bfq_bfqq_softrt_next_start() + * for details on soft rt detection), + * 4) these pending requests experience a high + * latency because the application is not + * weight-raised while they are pending. + */ + bfqq->last_wr_start_finish = jiffies; + bfqq->wr_cur_max_time = + bfqd->bfq_wr_rt_max_time; + } + } +set_prio_changed: + if (old_wr_coeff != bfqq->wr_coeff) + entity->prio_changed = 1; +add_bfqq_busy: + bfqq->last_idle_bklogged = jiffies; + bfqq->service_from_backlogged = 0; + bfq_clear_bfqq_softrt_update(bfqq); + bfq_add_bfqq_busy(bfqd, bfqq); + } else { + if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) && + time_is_before_jiffies( + bfqq->last_wr_start_finish + + bfqd->bfq_wr_min_inter_arr_async)) { + bfqq->wr_coeff = bfqd->bfq_wr_coeff; + bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); + + bfqd->wr_busy_queues++; + entity->prio_changed = 1; + bfq_log_bfqq(bfqd, bfqq, + "non-idle wrais starting at %lu, rais_max_time %u", + jiffies, + jiffies_to_msecs(bfqq->wr_cur_max_time)); + } + if (prev != bfqq->next_rq) + bfq_updated_next_req(bfqd, bfqq); + } + + if (bfqd->low_latency && + (old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive)) + bfqq->last_wr_start_finish = jiffies; +} + +static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd, + struct bio *bio) +{ + struct task_struct *tsk = current; + struct bfq_io_cq *bic; + struct bfq_queue *bfqq; + + bic = bfq_bic_lookup(bfqd, tsk->io_context); + if (!bic) + return NULL; + + bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio)); + if (bfqq) + return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio)); + + return NULL; +} + +static void bfq_activate_request(struct request_queue *q, struct request *rq) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + + bfqd->rq_in_driver++; + bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq); + bfq_log(bfqd, "activate_request: new bfqd->last_position %llu", + (long long unsigned)bfqd->last_position); +} + +static void bfq_deactivate_request(struct request_queue *q, struct request *rq) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + + BUG_ON(bfqd->rq_in_driver == 0); + bfqd->rq_in_driver--; +} + +static void bfq_remove_request(struct request *rq) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq); + struct bfq_data *bfqd = bfqq->bfqd; + const int sync = rq_is_sync(rq); + + if (bfqq->next_rq == rq) { + bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq); + bfq_updated_next_req(bfqd, bfqq); + } + + if (rq->queuelist.prev != &rq->queuelist) + list_del_init(&rq->queuelist); + BUG_ON(bfqq->queued[sync] == 0); + bfqq->queued[sync]--; + bfqd->queued--; + elv_rb_del(&bfqq->sort_list, rq); + + if (RB_EMPTY_ROOT(&bfqq->sort_list)) { + if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) + bfq_del_bfqq_busy(bfqd, bfqq, 1); + /* + * Remove queue from request-position tree as it is empty. + */ + if (bfqq->pos_root) { + rb_erase(&bfqq->pos_node, bfqq->pos_root); + bfqq->pos_root = NULL; + } + } + + if (rq->cmd_flags & REQ_META) { + BUG_ON(bfqq->meta_pending == 0); + bfqq->meta_pending--; + } +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_update_io_remove(bfqq_group(bfqq), rq->cmd_flags); +#endif +} + +static int bfq_merge(struct request_queue *q, struct request **req, + struct bio *bio) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + struct request *__rq; + + __rq = bfq_find_rq_fmerge(bfqd, bio); + if (__rq && elv_rq_merge_ok(__rq, bio)) { + *req = __rq; + return ELEVATOR_FRONT_MERGE; + } + + return ELEVATOR_NO_MERGE; +} + +static void bfq_merged_request(struct request_queue *q, struct request *req, + int type) +{ + if (type == ELEVATOR_FRONT_MERGE && + rb_prev(&req->rb_node) && + blk_rq_pos(req) < + blk_rq_pos(container_of(rb_prev(&req->rb_node), + struct request, rb_node))) { + struct bfq_queue *bfqq = RQ_BFQQ(req); + struct bfq_data *bfqd = bfqq->bfqd; + struct request *prev, *next_rq; + + /* Reposition request in its sort_list */ + elv_rb_del(&bfqq->sort_list, req); + elv_rb_add(&bfqq->sort_list, req); + /* Choose next request to be served for bfqq */ + prev = bfqq->next_rq; + next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req, + bfqd->last_position); + BUG_ON(!next_rq); + bfqq->next_rq = next_rq; + /* + * If next_rq changes, update both the queue's budget to + * fit the new request and the queue's position in its + * rq_pos_tree. + */ + if (prev != bfqq->next_rq) { + bfq_updated_next_req(bfqd, bfqq); + bfq_pos_tree_add_move(bfqd, bfqq); + } + } +} + +#ifdef CONFIG_BFQ_GROUP_IOSCHED +static void bfq_bio_merged(struct request_queue *q, struct request *req, + struct bio *bio) +{ + bfqg_stats_update_io_merged(bfqq_group(RQ_BFQQ(req)), bio->bi_rw); +} +#endif + +static void bfq_merged_requests(struct request_queue *q, struct request *rq, + struct request *next) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq), *next_bfqq = RQ_BFQQ(next); + + /* + * If next and rq belong to the same bfq_queue and next is older + * than rq, then reposition rq in the fifo (by substituting next + * with rq). Otherwise, if next and rq belong to different + * bfq_queues, never reposition rq: in fact, we would have to + * reposition it with respect to next's position in its own fifo, + * which would most certainly be too expensive with respect to + * the benefits. + */ + if (bfqq == next_bfqq && + !list_empty(&rq->queuelist) && !list_empty(&next->queuelist) && + time_before(next->fifo_time, rq->fifo_time)) { + list_del_init(&rq->queuelist); + list_replace_init(&next->queuelist, &rq->queuelist); + rq->fifo_time = next->fifo_time; + } + + if (bfqq->next_rq == next) + bfqq->next_rq = rq; + + bfq_remove_request(next); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_update_io_merged(bfqq_group(bfqq), next->cmd_flags); +#endif +} + +/* Must be called with bfqq != NULL */ +static void bfq_bfqq_end_wr(struct bfq_queue *bfqq) +{ + BUG_ON(!bfqq); + if (bfq_bfqq_busy(bfqq)) + bfqq->bfqd->wr_busy_queues--; + bfqq->wr_coeff = 1; + bfqq->wr_cur_max_time = 0; + /* Trigger a weight change on the next activation of the queue */ + bfqq->entity.prio_changed = 1; +} + +static void bfq_end_wr_async_queues(struct bfq_data *bfqd, + struct bfq_group *bfqg) +{ + int i, j; + + for (i = 0; i < 2; i++) + for (j = 0; j < IOPRIO_BE_NR; j++) + if (bfqg->async_bfqq[i][j]) + bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]); + if (bfqg->async_idle_bfqq) + bfq_bfqq_end_wr(bfqg->async_idle_bfqq); +} + +static void bfq_end_wr(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq; + + spin_lock_irq(bfqd->queue->queue_lock); + + list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) + bfq_bfqq_end_wr(bfqq); + list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) + bfq_bfqq_end_wr(bfqq); + bfq_end_wr_async(bfqd); + + spin_unlock_irq(bfqd->queue->queue_lock); +} + +static sector_t bfq_io_struct_pos(void *io_struct, bool request) +{ + if (request) + return blk_rq_pos(io_struct); + else + return ((struct bio *)io_struct)->bi_iter.bi_sector; +} + +static int bfq_rq_close_to_sector(void *io_struct, bool request, + sector_t sector) +{ + return abs(bfq_io_struct_pos(io_struct, request) - sector) <= + BFQQ_SEEK_THR; +} + +static struct bfq_queue *bfqq_find_close(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + sector_t sector) +{ + struct rb_root *root = &bfq_bfqq_to_bfqg(bfqq)->rq_pos_tree; + struct rb_node *parent, *node; + struct bfq_queue *__bfqq; + + if (RB_EMPTY_ROOT(root)) + return NULL; + + /* + * First, if we find a request starting at the end of the last + * request, choose it. + */ + __bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL); + if (__bfqq) + return __bfqq; + + /* + * If the exact sector wasn't found, the parent of the NULL leaf + * will contain the closest sector (rq_pos_tree sorted by + * next_request position). + */ + __bfqq = rb_entry(parent, struct bfq_queue, pos_node); + if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector)) + return __bfqq; + + if (blk_rq_pos(__bfqq->next_rq) < sector) + node = rb_next(&__bfqq->pos_node); + else + node = rb_prev(&__bfqq->pos_node); + if (!node) + return NULL; + + __bfqq = rb_entry(node, struct bfq_queue, pos_node); + if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector)) + return __bfqq; + + return NULL; +} + +static struct bfq_queue *bfq_find_close_cooperator(struct bfq_data *bfqd, + struct bfq_queue *cur_bfqq, + sector_t sector) +{ + struct bfq_queue *bfqq; + + /* + * We shall notice if some of the queues are cooperating, + * e.g., working closely on the same area of the device. In + * that case, we can group them together and: 1) don't waste + * time idling, and 2) serve the union of their requests in + * the best possible order for throughput. + */ + bfqq = bfqq_find_close(bfqd, cur_bfqq, sector); + if (!bfqq || bfqq == cur_bfqq) + return NULL; + + return bfqq; +} + +static struct bfq_queue * +bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq) +{ + int process_refs, new_process_refs; + struct bfq_queue *__bfqq; + + /* + * If there are no process references on the new_bfqq, then it is + * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain + * may have dropped their last reference (not just their last process + * reference). + */ + if (!bfqq_process_refs(new_bfqq)) + return NULL; + + /* Avoid a circular list and skip interim queue merges. */ + while ((__bfqq = new_bfqq->new_bfqq)) { + if (__bfqq == bfqq) + return NULL; + new_bfqq = __bfqq; + } + + process_refs = bfqq_process_refs(bfqq); + new_process_refs = bfqq_process_refs(new_bfqq); + /* + * If the process for the bfqq has gone away, there is no + * sense in merging the queues. + */ + if (process_refs == 0 || new_process_refs == 0) + return NULL; + + bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d", + new_bfqq->pid); + + /* + * Merging is just a redirection: the requests of the process + * owning one of the two queues are redirected to the other queue. + * The latter queue, in its turn, is set as shared if this is the + * first time that the requests of some process are redirected to + * it. + * + * We redirect bfqq to new_bfqq and not the opposite, because we + * are in the context of the process owning bfqq, hence we have + * the io_cq of this process. So we can immediately configure this + * io_cq to redirect the requests of the process to new_bfqq. + * + * NOTE, even if new_bfqq coincides with the in-service queue, the + * io_cq of new_bfqq is not available, because, if the in-service + * queue is shared, bfqd->in_service_bic may not point to the + * io_cq of the in-service queue. + * Redirecting the requests of the process owning bfqq to the + * currently in-service queue is in any case the best option, as + * we feed the in-service queue with new requests close to the + * last request served and, by doing so, hopefully increase the + * throughput. + */ + bfqq->new_bfqq = new_bfqq; + atomic_add(process_refs, &new_bfqq->ref); + return new_bfqq; +} + +static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq, + struct bfq_queue *new_bfqq) +{ + if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) || + (bfqq->ioprio_class != new_bfqq->ioprio_class)) + return false; + + /* + * If either of the queues has already been detected as seeky, + * then merging it with the other queue is unlikely to lead to + * sequential I/O. + */ + if (BFQQ_SEEKY(bfqq) || BFQQ_SEEKY(new_bfqq)) + return false; + + /* + * Interleaved I/O is known to be done by (some) applications + * only for reads, so it does not make sense to merge async + * queues. + */ + if (!bfq_bfqq_sync(bfqq) || !bfq_bfqq_sync(new_bfqq)) + return false; + + return true; +} + +/* + * Attempt to schedule a merge of bfqq with the currently in-service queue + * or with a close queue among the scheduled queues. + * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue + * structure otherwise. + * + * The OOM queue is not allowed to participate to cooperation: in fact, since + * the requests temporarily redirected to the OOM queue could be redirected + * again to dedicated queues at any time, the state needed to correctly + * handle merging with the OOM queue would be quite complex and expensive + * to maintain. Besides, in such a critical condition as an out of memory, + * the benefits of queue merging may be little relevant, or even negligible. + */ +static struct bfq_queue * +bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq, + void *io_struct, bool request) +{ + struct bfq_queue *in_service_bfqq, *new_bfqq; + + if (bfqq->new_bfqq) + return bfqq->new_bfqq; + if (!io_struct || unlikely(bfqq == &bfqd->oom_bfqq)) + return NULL; + /* If device has only one backlogged bfq_queue, don't search. */ + if (bfqd->busy_queues == 1) + return NULL; + + in_service_bfqq = bfqd->in_service_queue; + + if (!in_service_bfqq || in_service_bfqq == bfqq || + !bfqd->in_service_bic || + unlikely(in_service_bfqq == &bfqd->oom_bfqq)) + goto check_scheduled; + + if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) && + bfqq->entity.parent == in_service_bfqq->entity.parent && + bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) { + new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq); + if (new_bfqq) + return new_bfqq; + } + /* + * Check whether there is a cooperator among currently scheduled + * queues. The only thing we need is that the bio/request is not + * NULL, as we need it to establish whether a cooperator exists. + */ +check_scheduled: + new_bfqq = bfq_find_close_cooperator(bfqd, bfqq, + bfq_io_struct_pos(io_struct, request)); + + BUG_ON(new_bfqq && bfqq->entity.parent != new_bfqq->entity.parent); + + if (new_bfqq && likely(new_bfqq != &bfqd->oom_bfqq) && + bfq_may_be_close_cooperator(bfqq, new_bfqq)) + return bfq_setup_merge(bfqq, new_bfqq); + + return NULL; +} + +static void bfq_bfqq_save_state(struct bfq_queue *bfqq) +{ + /* + * If !bfqq->bic, the queue is already shared or its requests + * have already been redirected to a shared queue; both idle window + * and weight raising state have already been saved. Do nothing. + */ + if (!bfqq->bic) + return; + if (bfqq->bic->wr_time_left) + /* + * This is the queue of a just-started process, and would + * deserve weight raising: we set wr_time_left to the full + * weight-raising duration to trigger weight-raising when + * and if the queue is split and the first request of the + * queue is enqueued. + */ + bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd); + else if (bfqq->wr_coeff > 1) { + unsigned long wr_duration = + jiffies - bfqq->last_wr_start_finish; + /* + * It may happen that a queue's weight raising period lasts + * longer than its wr_cur_max_time, as weight raising is + * handled only when a request is enqueued or dispatched (it + * does not use any timer). If the weight raising period is + * about to end, don't save it. + */ + if (bfqq->wr_cur_max_time <= wr_duration) + bfqq->bic->wr_time_left = 0; + else + bfqq->bic->wr_time_left = + bfqq->wr_cur_max_time - wr_duration; + /* + * The bfq_queue is becoming shared or the requests of the + * process owning the queue are being redirected to a shared + * queue. Stop the weight raising period of the queue, as in + * both cases it should not be owned by an interactive or + * soft real-time application. + */ + bfq_bfqq_end_wr(bfqq); + } else + bfqq->bic->wr_time_left = 0; + bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq); + bfqq->bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq); + bfqq->bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq); + bfqq->bic->was_in_burst_list = !hlist_unhashed(&bfqq->burst_list_node); + bfqq->bic->cooperations++; + bfqq->bic->failed_cooperations = 0; +} + +static void bfq_get_bic_reference(struct bfq_queue *bfqq) +{ + /* + * If bfqq->bic has a non-NULL value, the bic to which it belongs + * is about to begin using a shared bfq_queue. + */ + if (bfqq->bic) + atomic_long_inc(&bfqq->bic->icq.ioc->refcount); +} + +static void +bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic, + struct bfq_queue *bfqq, struct bfq_queue *new_bfqq) +{ + bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu", + (long unsigned)new_bfqq->pid); + /* Save weight raising and idle window of the merged queues */ + bfq_bfqq_save_state(bfqq); + bfq_bfqq_save_state(new_bfqq); + if (bfq_bfqq_IO_bound(bfqq)) + bfq_mark_bfqq_IO_bound(new_bfqq); + bfq_clear_bfqq_IO_bound(bfqq); + /* + * Grab a reference to the bic, to prevent it from being destroyed + * before being possibly touched by a bfq_split_bfqq(). + */ + bfq_get_bic_reference(bfqq); + bfq_get_bic_reference(new_bfqq); + /* + * Merge queues (that is, let bic redirect its requests to new_bfqq) + */ + bic_set_bfqq(bic, new_bfqq, 1); + bfq_mark_bfqq_coop(new_bfqq); + /* + * new_bfqq now belongs to at least two bics (it is a shared queue): + * set new_bfqq->bic to NULL. bfqq either: + * - does not belong to any bic any more, and hence bfqq->bic must + * be set to NULL, or + * - is a queue whose owning bics have already been redirected to a + * different queue, hence the queue is destined to not belong to + * any bic soon and bfqq->bic is already NULL (therefore the next + * assignment causes no harm). + */ + new_bfqq->bic = NULL; + bfqq->bic = NULL; + bfq_put_queue(bfqq); +} + +static void bfq_bfqq_increase_failed_cooperations(struct bfq_queue *bfqq) +{ + struct bfq_io_cq *bic = bfqq->bic; + struct bfq_data *bfqd = bfqq->bfqd; + + if (bic && bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh) { + bic->failed_cooperations++; + if (bic->failed_cooperations >= bfqd->bfq_failed_cooperations) + bic->cooperations = 0; + } +} + +static int bfq_allow_merge(struct request_queue *q, struct request *rq, + struct bio *bio) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + struct bfq_io_cq *bic; + struct bfq_queue *bfqq, *new_bfqq; + + /* + * Disallow merge of a sync bio into an async request. + */ + if (bfq_bio_sync(bio) && !rq_is_sync(rq)) + return 0; + + /* + * Lookup the bfqq that this bio will be queued with. Allow + * merge only if rq is queued there. + * Queue lock is held here. + */ + bic = bfq_bic_lookup(bfqd, current->io_context); + if (!bic) + return 0; + + bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio)); + /* + * We take advantage of this function to perform an early merge + * of the queues of possible cooperating processes. + */ + if (bfqq) { + new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false); + if (new_bfqq) { + bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq); + /* + * If we get here, the bio will be queued in the + * shared queue, i.e., new_bfqq, so use new_bfqq + * to decide whether bio and rq can be merged. + */ + bfqq = new_bfqq; + } else + bfq_bfqq_increase_failed_cooperations(bfqq); + } + + return bfqq == RQ_BFQQ(rq); +} + +static void __bfq_set_in_service_queue(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + if (bfqq) { +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_update_avg_queue_size(bfqq_group(bfqq)); +#endif + bfq_mark_bfqq_must_alloc(bfqq); + bfq_mark_bfqq_budget_new(bfqq); + bfq_clear_bfqq_fifo_expire(bfqq); + + bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8; + + bfq_log_bfqq(bfqd, bfqq, + "set_in_service_queue, cur-budget = %d", + bfqq->entity.budget); + } + + bfqd->in_service_queue = bfqq; +} + +/* + * Get and set a new queue for service. + */ +static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq = bfq_get_next_queue(bfqd); + + __bfq_set_in_service_queue(bfqd, bfqq); + return bfqq; +} + +/* + * If enough samples have been computed, return the current max budget + * stored in bfqd, which is dynamically updated according to the + * estimated disk peak rate; otherwise return the default max budget + */ +static int bfq_max_budget(struct bfq_data *bfqd) +{ + if (bfqd->budgets_assigned < bfq_stats_min_budgets) + return bfq_default_max_budget; + else + return bfqd->bfq_max_budget; +} + +/* + * Return min budget, which is a fraction of the current or default + * max budget (trying with 1/32) + */ +static int bfq_min_budget(struct bfq_data *bfqd) +{ + if (bfqd->budgets_assigned < bfq_stats_min_budgets) + return bfq_default_max_budget / 32; + else + return bfqd->bfq_max_budget / 32; +} + +static void bfq_arm_slice_timer(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq = bfqd->in_service_queue; + struct bfq_io_cq *bic; + unsigned long sl; + + BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list)); + + /* Processes have exited, don't wait. */ + bic = bfqd->in_service_bic; + if (!bic || atomic_read(&bic->icq.ioc->active_ref) == 0) + return; + + bfq_mark_bfqq_wait_request(bfqq); + + /* + * We don't want to idle for seeks, but we do want to allow + * fair distribution of slice time for a process doing back-to-back + * seeks. So allow a little bit of time for him to submit a new rq. + * + * To prevent processes with (partly) seeky workloads from + * being too ill-treated, grant them a small fraction of the + * assigned budget before reducing the waiting time to + * BFQ_MIN_TT. This happened to help reduce latency. + */ + sl = bfqd->bfq_slice_idle; + /* + * Unless the queue is being weight-raised or the scenario is + * asymmetric, grant only minimum idle time if the queue either + * has been seeky for long enough or has already proved to be + * constantly seeky. + */ + if (bfq_sample_valid(bfqq->seek_samples) && + ((BFQQ_SEEKY(bfqq) && bfqq->entity.service > + bfq_max_budget(bfqq->bfqd) / 8) || + bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1 && + bfq_symmetric_scenario(bfqd)) + sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT)); + else if (bfqq->wr_coeff > 1) + sl = sl * 3; + bfqd->last_idling_start = ktime_get(); + mod_timer(&bfqd->idle_slice_timer, jiffies + sl); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_set_start_idle_time(bfqq_group(bfqq)); +#endif + bfq_log(bfqd, "arm idle: %u/%u ms", + jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle)); +} + +/* + * Set the maximum time for the in-service queue to consume its + * budget. This prevents seeky processes from lowering the disk + * throughput (always guaranteed with a time slice scheme as in CFQ). + */ +static void bfq_set_budget_timeout(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq = bfqd->in_service_queue; + unsigned int timeout_coeff; + if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time) + timeout_coeff = 1; + else + timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight; + + bfqd->last_budget_start = ktime_get(); + + bfq_clear_bfqq_budget_new(bfqq); + bfqq->budget_timeout = jiffies + + bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff; + + bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u", + jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * + timeout_coeff)); +} + +/* + * Move request from internal lists to the request queue dispatch list. + */ +static void bfq_dispatch_insert(struct request_queue *q, struct request *rq) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + struct bfq_queue *bfqq = RQ_BFQQ(rq); + + /* + * For consistency, the next instruction should have been executed + * after removing the request from the queue and dispatching it. + * We execute instead this instruction before bfq_remove_request() + * (and hence introduce a temporary inconsistency), for efficiency. + * In fact, in a forced_dispatch, this prevents two counters related + * to bfqq->dispatched to risk to be uselessly decremented if bfqq + * is not in service, and then to be incremented again after + * incrementing bfqq->dispatched. + */ + bfqq->dispatched++; + bfq_remove_request(rq); + elv_dispatch_sort(q, rq); + + if (bfq_bfqq_sync(bfqq)) + bfqd->sync_flight++; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_update_dispatch(bfqq_group(bfqq), blk_rq_bytes(rq), + rq->cmd_flags); +#endif +} + +/* + * Return expired entry, or NULL to just start from scratch in rbtree. + */ +static struct request *bfq_check_fifo(struct bfq_queue *bfqq) +{ + struct request *rq = NULL; + + if (bfq_bfqq_fifo_expire(bfqq)) + return NULL; + + bfq_mark_bfqq_fifo_expire(bfqq); + + if (list_empty(&bfqq->fifo)) + return NULL; + + rq = rq_entry_fifo(bfqq->fifo.next); + + if (time_before(jiffies, rq->fifo_time)) + return NULL; + + return rq; +} + +static int bfq_bfqq_budget_left(struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + return entity->budget - entity->service; +} + +static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + BUG_ON(bfqq != bfqd->in_service_queue); + + __bfq_bfqd_reset_in_service(bfqd); + + /* + * If this bfqq is shared between multiple processes, check + * to make sure that those processes are still issuing I/Os + * within the mean seek distance. If not, it may be time to + * break the queues apart again. + */ + if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq)) + bfq_mark_bfqq_split_coop(bfqq); + + if (RB_EMPTY_ROOT(&bfqq->sort_list)) { + /* + * Overloading budget_timeout field to store the time + * at which the queue remains with no backlog; used by + * the weight-raising mechanism. + */ + bfqq->budget_timeout = jiffies; + bfq_del_bfqq_busy(bfqd, bfqq, 1); + } else { + bfq_activate_bfqq(bfqd, bfqq); + /* + * Resort priority tree of potential close cooperators. + */ + bfq_pos_tree_add_move(bfqd, bfqq); + } +} + +/** + * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior. + * @bfqd: device data. + * @bfqq: queue to update. + * @reason: reason for expiration. + * + * Handle the feedback on @bfqq budget at queue expiration. + * See the body for detailed comments. + */ +static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + enum bfqq_expiration reason) +{ + struct request *next_rq; + int budget, min_budget; + + budget = bfqq->max_budget; + min_budget = bfq_min_budget(bfqd); + + BUG_ON(bfqq != bfqd->in_service_queue); + + bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d", + bfqq->entity.budget, bfq_bfqq_budget_left(bfqq)); + bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d", + budget, bfq_min_budget(bfqd)); + bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d", + bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue)); + + if (bfq_bfqq_sync(bfqq)) { + switch (reason) { + /* + * Caveat: in all the following cases we trade latency + * for throughput. + */ + case BFQ_BFQQ_TOO_IDLE: + /* + * This is the only case where we may reduce + * the budget: if there is no request of the + * process still waiting for completion, then + * we assume (tentatively) that the timer has + * expired because the batch of requests of + * the process could have been served with a + * smaller budget. Hence, betting that + * process will behave in the same way when it + * becomes backlogged again, we reduce its + * next budget. As long as we guess right, + * this budget cut reduces the latency + * experienced by the process. + * + * However, if there are still outstanding + * requests, then the process may have not yet + * issued its next request just because it is + * still waiting for the completion of some of + * the still outstanding ones. So in this + * subcase we do not reduce its budget, on the + * contrary we increase it to possibly boost + * the throughput, as discussed in the + * comments to the BUDGET_TIMEOUT case. + */ + if (bfqq->dispatched > 0) /* still outstanding reqs */ + budget = min(budget * 2, bfqd->bfq_max_budget); + else { + if (budget > 5 * min_budget) + budget -= 4 * min_budget; + else + budget = min_budget; + } + break; + case BFQ_BFQQ_BUDGET_TIMEOUT: + /* + * We double the budget here because: 1) it + * gives the chance to boost the throughput if + * this is not a seeky process (which may have + * bumped into this timeout because of, e.g., + * ZBR), 2) together with charge_full_budget + * it helps give seeky processes higher + * timestamps, and hence be served less + * frequently. + */ + budget = min(budget * 2, bfqd->bfq_max_budget); + break; + case BFQ_BFQQ_BUDGET_EXHAUSTED: + /* + * The process still has backlog, and did not + * let either the budget timeout or the disk + * idling timeout expire. Hence it is not + * seeky, has a short thinktime and may be + * happy with a higher budget too. So + * definitely increase the budget of this good + * candidate to boost the disk throughput. + */ + budget = min(budget * 4, bfqd->bfq_max_budget); + break; + case BFQ_BFQQ_NO_MORE_REQUESTS: + /* + * Leave the budget unchanged. + */ + default: + return; + } + } else + /* + * Async queues get always the maximum possible budget + * (their ability to dispatch is limited by + * @bfqd->bfq_max_budget_async_rq). + */ + budget = bfqd->bfq_max_budget; + + bfqq->max_budget = budget; + + if (bfqd->budgets_assigned >= bfq_stats_min_budgets && + !bfqd->bfq_user_max_budget) + bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget); + + /* + * Make sure that we have enough budget for the next request. + * Since the finish time of the bfqq must be kept in sync with + * the budget, be sure to call __bfq_bfqq_expire() after the + * update. + */ + next_rq = bfqq->next_rq; + if (next_rq) + bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget, + bfq_serv_to_charge(next_rq, bfqq)); + else + bfqq->entity.budget = bfqq->max_budget; + + bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %d", + next_rq ? blk_rq_sectors(next_rq) : 0, + bfqq->entity.budget); +} + +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout) +{ + unsigned long max_budget; + + /* + * The max_budget calculated when autotuning is equal to the + * amount of sectors transfered in timeout_sync at the + * estimated peak rate. + */ + max_budget = (unsigned long)(peak_rate * 1000 * + timeout >> BFQ_RATE_SHIFT); + + return max_budget; +} + +/* + * In addition to updating the peak rate, checks whether the process + * is "slow", and returns 1 if so. This slow flag is used, in addition + * to the budget timeout, to reduce the amount of service provided to + * seeky processes, and hence reduce their chances to lower the + * throughput. See the code for more details. + */ +static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool compensate, enum bfqq_expiration reason) +{ + u64 bw, usecs, expected, timeout; + ktime_t delta; + int update = 0; + + if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq)) + return false; + + if (compensate) + delta = bfqd->last_idling_start; + else + delta = ktime_get(); + delta = ktime_sub(delta, bfqd->last_budget_start); + usecs = ktime_to_us(delta); + + /* Don't trust short/unrealistic values. */ + if (usecs < 100 || usecs >= LONG_MAX) + return false; + + /* + * Calculate the bandwidth for the last slice. We use a 64 bit + * value to store the peak rate, in sectors per usec in fixed + * point math. We do so to have enough precision in the estimate + * and to avoid overflows. + */ + bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT; + do_div(bw, (unsigned long)usecs); + + timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]); + + /* + * Use only long (> 20ms) intervals to filter out spikes for + * the peak rate estimation. + */ + if (usecs > 20000) { + if (bw > bfqd->peak_rate || + (!BFQQ_SEEKY(bfqq) && + reason == BFQ_BFQQ_BUDGET_TIMEOUT)) { + bfq_log(bfqd, "measured bw =%llu", bw); + /* + * To smooth oscillations use a low-pass filter with + * alpha=7/8, i.e., + * new_rate = (7/8) * old_rate + (1/8) * bw + */ + do_div(bw, 8); + if (bw == 0) + return 0; + bfqd->peak_rate *= 7; + do_div(bfqd->peak_rate, 8); + bfqd->peak_rate += bw; + update = 1; + bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate); + } + + update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1; + + if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES) + bfqd->peak_rate_samples++; + + if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES && + update) { + int dev_type = blk_queue_nonrot(bfqd->queue); + if (bfqd->bfq_user_max_budget == 0) { + bfqd->bfq_max_budget = + bfq_calc_max_budget(bfqd->peak_rate, + timeout); + bfq_log(bfqd, "new max_budget=%d", + bfqd->bfq_max_budget); + } + if (bfqd->device_speed == BFQ_BFQD_FAST && + bfqd->peak_rate < device_speed_thresh[dev_type]) { + bfqd->device_speed = BFQ_BFQD_SLOW; + bfqd->RT_prod = R_slow[dev_type] * + T_slow[dev_type]; + } else if (bfqd->device_speed == BFQ_BFQD_SLOW && + bfqd->peak_rate > device_speed_thresh[dev_type]) { + bfqd->device_speed = BFQ_BFQD_FAST; + bfqd->RT_prod = R_fast[dev_type] * + T_fast[dev_type]; + } + } + } + + /* + * If the process has been served for a too short time + * interval to let its possible sequential accesses prevail on + * the initial seek time needed to move the disk head on the + * first sector it requested, then give the process a chance + * and for the moment return false. + */ + if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8) + return false; + + /* + * A process is considered ``slow'' (i.e., seeky, so that we + * cannot treat it fairly in the service domain, as it would + * slow down too much the other processes) if, when a slice + * ends for whatever reason, it has received service at a + * rate that would not be high enough to complete the budget + * before the budget timeout expiration. + */ + expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT; + + /* + * Caveat: processes doing IO in the slower disk zones will + * tend to be slow(er) even if not seeky. And the estimated + * peak rate will actually be an average over the disk + * surface. Hence, to not be too harsh with unlucky processes, + * we keep a budget/3 margin of safety before declaring a + * process slow. + */ + return expected > (4 * bfqq->entity.budget) / 3; +} + +/* + * To be deemed as soft real-time, an application must meet two + * requirements. First, the application must not require an average + * bandwidth higher than the approximate bandwidth required to playback or + * record a compressed high-definition video. + * The next function is invoked on the completion of the last request of a + * batch, to compute the next-start time instant, soft_rt_next_start, such + * that, if the next request of the application does not arrive before + * soft_rt_next_start, then the above requirement on the bandwidth is met. + * + * The second requirement is that the request pattern of the application is + * isochronous, i.e., that, after issuing a request or a batch of requests, + * the application stops issuing new requests until all its pending requests + * have been completed. After that, the application may issue a new batch, + * and so on. + * For this reason the next function is invoked to compute + * soft_rt_next_start only for applications that meet this requirement, + * whereas soft_rt_next_start is set to infinity for applications that do + * not. + * + * Unfortunately, even a greedy application may happen to behave in an + * isochronous way if the CPU load is high. In fact, the application may + * stop issuing requests while the CPUs are busy serving other processes, + * then restart, then stop again for a while, and so on. In addition, if + * the disk achieves a low enough throughput with the request pattern + * issued by the application (e.g., because the request pattern is random + * and/or the device is slow), then the application may meet the above + * bandwidth requirement too. To prevent such a greedy application to be + * deemed as soft real-time, a further rule is used in the computation of + * soft_rt_next_start: soft_rt_next_start must be higher than the current + * time plus the maximum time for which the arrival of a request is waited + * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle. + * This filters out greedy applications, as the latter issue instead their + * next request as soon as possible after the last one has been completed + * (in contrast, when a batch of requests is completed, a soft real-time + * application spends some time processing data). + * + * Unfortunately, the last filter may easily generate false positives if + * only bfqd->bfq_slice_idle is used as a reference time interval and one + * or both the following cases occur: + * 1) HZ is so low that the duration of a jiffy is comparable to or higher + * than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with + * HZ=100. + * 2) jiffies, instead of increasing at a constant rate, may stop increasing + * for a while, then suddenly 'jump' by several units to recover the lost + * increments. This seems to happen, e.g., inside virtual machines. + * To address this issue, we do not use as a reference time interval just + * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In + * particular we add the minimum number of jiffies for which the filter + * seems to be quite precise also in embedded systems and KVM/QEMU virtual + * machines. + */ +static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + return max(bfqq->last_idle_bklogged + + HZ * bfqq->service_from_backlogged / + bfqd->bfq_wr_max_softrt_rate, + jiffies + bfqq->bfqd->bfq_slice_idle + 4); +} + +/* + * Return the largest-possible time instant such that, for as long as possible, + * the current time will be lower than this time instant according to the macro + * time_is_before_jiffies(). + */ +static unsigned long bfq_infinity_from_now(unsigned long now) +{ + return now + ULONG_MAX / 2; +} + +/** + * bfq_bfqq_expire - expire a queue. + * @bfqd: device owning the queue. + * @bfqq: the queue to expire. + * @compensate: if true, compensate for the time spent idling. + * @reason: the reason causing the expiration. + * + * + * If the process associated to the queue is slow (i.e., seeky), or in + * case of budget timeout, or, finally, if it is async, we + * artificially charge it an entire budget (independently of the + * actual service it received). As a consequence, the queue will get + * higher timestamps than the correct ones upon reactivation, and + * hence it will be rescheduled as if it had received more service + * than what it actually received. In the end, this class of processes + * will receive less service in proportion to how slowly they consume + * their budgets (and hence how seriously they tend to lower the + * throughput). + * + * In contrast, when a queue expires because it has been idling for + * too much or because it exhausted its budget, we do not touch the + * amount of service it has received. Hence when the queue will be + * reactivated and its timestamps updated, the latter will be in sync + * with the actual service received by the queue until expiration. + * + * Charging a full budget to the first type of queues and the exact + * service to the others has the effect of using the WF2Q+ policy to + * schedule the former on a timeslice basis, without violating the + * service domain guarantees of the latter. + */ +static void bfq_bfqq_expire(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + bool compensate, + enum bfqq_expiration reason) +{ + bool slow; + BUG_ON(bfqq != bfqd->in_service_queue); + + /* + * Update disk peak rate for autotuning and check whether the + * process is slow (see bfq_update_peak_rate). + */ + slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason); + + /* + * As above explained, 'punish' slow (i.e., seeky), timed-out + * and async queues, to favor sequential sync workloads. + * + * Processes doing I/O in the slower disk zones will tend to be + * slow(er) even if not seeky. Hence, since the estimated peak + * rate is actually an average over the disk surface, these + * processes may timeout just for bad luck. To avoid punishing + * them we do not charge a full budget to a process that + * succeeded in consuming at least 2/3 of its budget. + */ + if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT && + bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3)) + bfq_bfqq_charge_full_budget(bfqq); + + bfqq->service_from_backlogged += bfqq->entity.service; + + if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT && + !bfq_bfqq_constantly_seeky(bfqq)) { + bfq_mark_bfqq_constantly_seeky(bfqq); + if (!blk_queue_nonrot(bfqd->queue)) + bfqd->const_seeky_busy_in_flight_queues++; + } + + if (reason == BFQ_BFQQ_TOO_IDLE && + bfqq->entity.service <= 2 * bfqq->entity.budget / 10 ) + bfq_clear_bfqq_IO_bound(bfqq); + + if (bfqd->low_latency && bfqq->wr_coeff == 1) + bfqq->last_wr_start_finish = jiffies; + + if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 && + RB_EMPTY_ROOT(&bfqq->sort_list)) { + /* + * If we get here, and there are no outstanding requests, + * then the request pattern is isochronous (see the comments + * to the function bfq_bfqq_softrt_next_start()). Hence we + * can compute soft_rt_next_start. If, instead, the queue + * still has outstanding requests, then we have to wait + * for the completion of all the outstanding requests to + * discover whether the request pattern is actually + * isochronous. + */ + if (bfqq->dispatched == 0) + bfqq->soft_rt_next_start = + bfq_bfqq_softrt_next_start(bfqd, bfqq); + else { + /* + * The application is still waiting for the + * completion of one or more requests: + * prevent it from possibly being incorrectly + * deemed as soft real-time by setting its + * soft_rt_next_start to infinity. In fact, + * without this assignment, the application + * would be incorrectly deemed as soft + * real-time if: + * 1) it issued a new request before the + * completion of all its in-flight + * requests, and + * 2) at that time, its soft_rt_next_start + * happened to be in the past. + */ + bfqq->soft_rt_next_start = + bfq_infinity_from_now(jiffies); + /* + * Schedule an update of soft_rt_next_start to when + * the task may be discovered to be isochronous. + */ + bfq_mark_bfqq_softrt_update(bfqq); + } + } + + bfq_log_bfqq(bfqd, bfqq, + "expire (%d, slow %d, num_disp %d, idle_win %d)", reason, + slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq)); + + /* + * Increase, decrease or leave budget unchanged according to + * reason. + */ + __bfq_bfqq_recalc_budget(bfqd, bfqq, reason); + __bfq_bfqq_expire(bfqd, bfqq); +} + +/* + * Budget timeout is not implemented through a dedicated timer, but + * just checked on request arrivals and completions, as well as on + * idle timer expirations. + */ +static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq) +{ + if (bfq_bfqq_budget_new(bfqq) || + time_before(jiffies, bfqq->budget_timeout)) + return false; + return true; +} + +/* + * If we expire a queue that is waiting for the arrival of a new + * request, we may prevent the fictitious timestamp back-shifting that + * allows the guarantees of the queue to be preserved (see [1] for + * this tricky aspect). Hence we return true only if this condition + * does not hold, or if the queue is slow enough to deserve only to be + * kicked off for preserving a high throughput. +*/ +static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq) +{ + bfq_log_bfqq(bfqq->bfqd, bfqq, + "may_budget_timeout: wait_request %d left %d timeout %d", + bfq_bfqq_wait_request(bfqq), + bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3, + bfq_bfqq_budget_timeout(bfqq)); + + return (!bfq_bfqq_wait_request(bfqq) || + bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3) + && + bfq_bfqq_budget_timeout(bfqq); +} + +/* + * For a queue that becomes empty, device idling is allowed only if + * this function returns true for that queue. As a consequence, since + * device idling plays a critical role for both throughput boosting + * and service guarantees, the return value of this function plays a + * critical role as well. + * + * In a nutshell, this function returns true only if idling is + * beneficial for throughput or, even if detrimental for throughput, + * idling is however necessary to preserve service guarantees (low + * latency, desired throughput distribution, ...). In particular, on + * NCQ-capable devices, this function tries to return false, so as to + * help keep the drives' internal queues full, whenever this helps the + * device boost the throughput without causing any service-guarantee + * issue. + * + * In more detail, the return value of this function is obtained by, + * first, computing a number of boolean variables that take into + * account throughput and service-guarantee issues, and, then, + * combining these variables in a logical expression. Most of the + * issues taken into account are not trivial. We discuss these issues + * while introducing the variables. + */ +static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) +{ + struct bfq_data *bfqd = bfqq->bfqd; + bool idling_boosts_thr, idling_boosts_thr_without_issues, + all_queues_seeky, on_hdd_and_not_all_queues_seeky, + idling_needed_for_service_guarantees, + asymmetric_scenario; + + /* + * The next variable takes into account the cases where idling + * boosts the throughput. + * + * The value of the variable is computed considering, first, that + * idling is virtually always beneficial for the throughput if: + * (a) the device is not NCQ-capable, or + * (b) regardless of the presence of NCQ, the device is rotational + * and the request pattern for bfqq is I/O-bound and sequential. + * + * Secondly, and in contrast to the above item (b), idling an + * NCQ-capable flash-based device would not boost the + * throughput even with sequential I/O; rather it would lower + * the throughput in proportion to how fast the device + * is. Accordingly, the next variable is true if any of the + * above conditions (a) and (b) is true, and, in particular, + * happens to be false if bfqd is an NCQ-capable flash-based + * device. + */ + idling_boosts_thr = !bfqd->hw_tag || + (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq) && + bfq_bfqq_idle_window(bfqq)) ; + + /* + * The value of the next variable, + * idling_boosts_thr_without_issues, is equal to that of + * idling_boosts_thr, unless a special case holds. In this + * special case, described below, idling may cause problems to + * weight-raised queues. + * + * When the request pool is saturated (e.g., in the presence + * of write hogs), if the processes associated with + * non-weight-raised queues ask for requests at a lower rate, + * then processes associated with weight-raised queues have a + * higher probability to get a request from the pool + * immediately (or at least soon) when they need one. Thus + * they have a higher probability to actually get a fraction + * of the device throughput proportional to their high + * weight. This is especially true with NCQ-capable drives, + * which enqueue several requests in advance, and further + * reorder internally-queued requests. + * + * For this reason, we force to false the value of + * idling_boosts_thr_without_issues if there are weight-raised + * busy queues. In this case, and if bfqq is not weight-raised, + * this guarantees that the device is not idled for bfqq (if, + * instead, bfqq is weight-raised, then idling will be + * guaranteed by another variable, see below). Combined with + * the timestamping rules of BFQ (see [1] for details), this + * behavior causes bfqq, and hence any sync non-weight-raised + * queue, to get a lower number of requests served, and thus + * to ask for a lower number of requests from the request + * pool, before the busy weight-raised queues get served + * again. This often mitigates starvation problems in the + * presence of heavy write workloads and NCQ, thereby + * guaranteeing a higher application and system responsiveness + * in these hostile scenarios. + */ + idling_boosts_thr_without_issues = idling_boosts_thr && + bfqd->wr_busy_queues == 0; + + /* + * There are then two cases where idling must be performed not + * for throughput concerns, but to preserve service + * guarantees. In the description of these cases, we say, for + * short, that a queue is sequential/random if the process + * associated to the queue issues sequential/random requests + * (in the second case the queue may be tagged as seeky or + * even constantly_seeky). + * + * To introduce the first case, we note that, since + * bfq_bfqq_idle_window(bfqq) is false if the device is + * NCQ-capable and bfqq is random (see + * bfq_update_idle_window()), then, from the above two + * assignments it follows that + * idling_boosts_thr_without_issues is false if the device is + * NCQ-capable and bfqq is random. Therefore, for this case, + * device idling would never be allowed if we used just + * idling_boosts_thr_without_issues to decide whether to allow + * it. And, beneficially, this would imply that throughput + * would always be boosted also with random I/O on NCQ-capable + * HDDs. + * + * But we must be careful on this point, to avoid an unfair + * treatment for bfqq. In fact, because of the same above + * assignments, idling_boosts_thr_without_issues is, on the + * other hand, true if 1) the device is an HDD and bfqq is + * sequential, and 2) there are no busy weight-raised + * queues. As a consequence, if we used just + * idling_boosts_thr_without_issues to decide whether to idle + * the device, then with an HDD we might easily bump into a + * scenario where queues that are sequential and I/O-bound + * would enjoy idling, whereas random queues would not. The + * latter might then get a low share of the device throughput, + * simply because the former would get many requests served + * after being set as in service, while the latter would not. + * + * To address this issue, we start by setting to true a + * sentinel variable, on_hdd_and_not_all_queues_seeky, if the + * device is rotational and not all queues with pending or + * in-flight requests are constantly seeky (i.e., there are + * active sequential queues, and bfqq might then be mistreated + * if it does not enjoy idling because it is random). + */ + all_queues_seeky = bfq_bfqq_constantly_seeky(bfqq) && + bfqd->busy_in_flight_queues == + bfqd->const_seeky_busy_in_flight_queues; + + on_hdd_and_not_all_queues_seeky = + !blk_queue_nonrot(bfqd->queue) && !all_queues_seeky; + + /* + * To introduce the second case where idling needs to be + * performed to preserve service guarantees, we can note that + * allowing the drive to enqueue more than one request at a + * time, and hence delegating de facto final scheduling + * decisions to the drive's internal scheduler, causes loss of + * control on the actual request service order. In particular, + * the critical situation is when requests from different + * processes happens to be present, at the same time, in the + * internal queue(s) of the drive. In such a situation, the + * drive, by deciding the service order of the + * internally-queued requests, does determine also the actual + * throughput distribution among these processes. But the + * drive typically has no notion or concern about per-process + * throughput distribution, and makes its decisions only on a + * per-request basis. Therefore, the service distribution + * enforced by the drive's internal scheduler is likely to + * coincide with the desired device-throughput distribution + * only in a completely symmetric scenario where: + * (i) each of these processes must get the same throughput as + * the others; + * (ii) all these processes have the same I/O pattern + (either sequential or random). + * In fact, in such a scenario, the drive will tend to treat + * the requests of each of these processes in about the same + * way as the requests of the others, and thus to provide + * each of these processes with about the same throughput + * (which is exactly the desired throughput distribution). In + * contrast, in any asymmetric scenario, device idling is + * certainly needed to guarantee that bfqq receives its + * assigned fraction of the device throughput (see [1] for + * details). + * + * We address this issue by controlling, actually, only the + * symmetry sub-condition (i), i.e., provided that + * sub-condition (i) holds, idling is not performed, + * regardless of whether sub-condition (ii) holds. In other + * words, only if sub-condition (i) holds, then idling is + * allowed, and the device tends to be prevented from queueing + * many requests, possibly of several processes. The reason + * for not controlling also sub-condition (ii) is that, first, + * in the case of an HDD, the asymmetry in terms of types of + * I/O patterns is already taken in to account in the above + * sentinel variable + * on_hdd_and_not_all_queues_seeky. Secondly, in the case of a + * flash-based device, we prefer however to privilege + * throughput (and idling lowers throughput for this type of + * devices), for the following reasons: + * 1) differently from HDDs, the service time of random + * requests is not orders of magnitudes lower than the service + * time of sequential requests; thus, even if processes doing + * sequential I/O get a preferential treatment with respect to + * others doing random I/O, the consequences are not as + * dramatic as with HDDs; + * 2) if a process doing random I/O does need strong + * throughput guarantees, it is hopefully already being + * weight-raised, or the user is likely to have assigned it a + * higher weight than the other processes (and thus + * sub-condition (i) is likely to be false, which triggers + * idling). + * + * According to the above considerations, the next variable is + * true (only) if sub-condition (i) holds. To compute the + * value of this variable, we not only use the return value of + * the function bfq_symmetric_scenario(), but also check + * whether bfqq is being weight-raised, because + * bfq_symmetric_scenario() does not take into account also + * weight-raised queues (see comments to + * bfq_weights_tree_add()). + * + * As a side note, it is worth considering that the above + * device-idling countermeasures may however fail in the + * following unlucky scenario: if idling is (correctly) + * disabled in a time period during which all symmetry + * sub-conditions hold, and hence the device is allowed to + * enqueue many requests, but at some later point in time some + * sub-condition stops to hold, then it may become impossible + * to let requests be served in the desired order until all + * the requests already queued in the device have been served. + */ + asymmetric_scenario = bfqq->wr_coeff > 1 || + !bfq_symmetric_scenario(bfqd); + + /* + * Finally, there is a case where maximizing throughput is the + * best choice even if it may cause unfairness toward + * bfqq. Such a case is when bfqq became active in a burst of + * queue activations. Queues that became active during a large + * burst benefit only from throughput, as discussed in the + * comments to bfq_handle_burst. Thus, if bfqq became active + * in a burst and not idling the device maximizes throughput, + * then the device must no be idled, because not idling the + * device provides bfqq and all other queues in the burst with + * maximum benefit. Combining this and the two cases above, we + * can now establish when idling is actually needed to + * preserve service guarantees. + */ + idling_needed_for_service_guarantees = + (on_hdd_and_not_all_queues_seeky || asymmetric_scenario) && + !bfq_bfqq_in_large_burst(bfqq); + + /* + * We have now all the components we need to compute the return + * value of the function, which is true only if both the following + * conditions hold: + * 1) bfqq is sync, because idling make sense only for sync queues; + * 2) idling either boosts the throughput (without issues), or + * is necessary to preserve service guarantees. + */ + return bfq_bfqq_sync(bfqq) && + (idling_boosts_thr_without_issues || + idling_needed_for_service_guarantees); +} + +/* + * If the in-service queue is empty but the function bfq_bfqq_may_idle + * returns true, then: + * 1) the queue must remain in service and cannot be expired, and + * 2) the device must be idled to wait for the possible arrival of a new + * request for the queue. + * See the comments to the function bfq_bfqq_may_idle for the reasons + * why performing device idling is the best choice to boost the throughput + * and preserve service guarantees when bfq_bfqq_may_idle itself + * returns true. + */ +static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq) +{ + struct bfq_data *bfqd = bfqq->bfqd; + + return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 && + bfq_bfqq_may_idle(bfqq); +} + +/* + * Select a queue for service. If we have a current queue in service, + * check whether to continue servicing it, or retrieve and set a new one. + */ +static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq; + struct request *next_rq; + enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT; + + bfqq = bfqd->in_service_queue; + if (!bfqq) + goto new_queue; + + bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue"); + + if (bfq_may_expire_for_budg_timeout(bfqq) && + !timer_pending(&bfqd->idle_slice_timer) && + !bfq_bfqq_must_idle(bfqq)) + goto expire; + + next_rq = bfqq->next_rq; + /* + * If bfqq has requests queued and it has enough budget left to + * serve them, keep the queue, otherwise expire it. + */ + if (next_rq) { + if (bfq_serv_to_charge(next_rq, bfqq) > + bfq_bfqq_budget_left(bfqq)) { + reason = BFQ_BFQQ_BUDGET_EXHAUSTED; + goto expire; + } else { + /* + * The idle timer may be pending because we may + * not disable disk idling even when a new request + * arrives. + */ + if (timer_pending(&bfqd->idle_slice_timer)) { + /* + * If we get here: 1) at least a new request + * has arrived but we have not disabled the + * timer because the request was too small, + * 2) then the block layer has unplugged + * the device, causing the dispatch to be + * invoked. + * + * Since the device is unplugged, now the + * requests are probably large enough to + * provide a reasonable throughput. + * So we disable idling. + */ + bfq_clear_bfqq_wait_request(bfqq); + del_timer(&bfqd->idle_slice_timer); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_update_idle_time(bfqq_group(bfqq)); +#endif + } + goto keep_queue; + } + } + + /* + * No requests pending. However, if the in-service queue is idling + * for a new request, or has requests waiting for a completion and + * may idle after their completion, then keep it anyway. + */ + if (timer_pending(&bfqd->idle_slice_timer) || + (bfqq->dispatched != 0 && bfq_bfqq_may_idle(bfqq))) { + bfqq = NULL; + goto keep_queue; + } + + reason = BFQ_BFQQ_NO_MORE_REQUESTS; +expire: + bfq_bfqq_expire(bfqd, bfqq, false, reason); +new_queue: + bfqq = bfq_set_in_service_queue(bfqd); + bfq_log(bfqd, "select_queue: new queue %d returned", + bfqq ? bfqq->pid : 0); +keep_queue: + return bfqq; +} + +static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */ + bfq_log_bfqq(bfqd, bfqq, + "raising period dur %u/%u msec, old coeff %u, w %d(%d)", + jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish), + jiffies_to_msecs(bfqq->wr_cur_max_time), + bfqq->wr_coeff, + bfqq->entity.weight, bfqq->entity.orig_weight); + + BUG_ON(bfqq != bfqd->in_service_queue && entity->weight != + entity->orig_weight * bfqq->wr_coeff); + if (entity->prio_changed) + bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change"); + + /* + * If the queue was activated in a burst, or + * too much time has elapsed from the beginning + * of this weight-raising period, or the queue has + * exceeded the acceptable number of cooperations, + * then end weight raising. + */ + if (bfq_bfqq_in_large_burst(bfqq) || + bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh || + time_is_before_jiffies(bfqq->last_wr_start_finish + + bfqq->wr_cur_max_time)) { + bfqq->last_wr_start_finish = jiffies; + bfq_log_bfqq(bfqd, bfqq, + "wrais ending at %lu, rais_max_time %u", + bfqq->last_wr_start_finish, + jiffies_to_msecs(bfqq->wr_cur_max_time)); + bfq_bfqq_end_wr(bfqq); + } + } + /* Update weight both if it must be raised and if it must be lowered */ + if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1)) + __bfq_entity_update_weight_prio( + bfq_entity_service_tree(entity), + entity); +} + +/* + * Dispatch one request from bfqq, moving it to the request queue + * dispatch list. + */ +static int bfq_dispatch_request(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + int dispatched = 0; + struct request *rq; + unsigned long service_to_charge; + + BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list)); + + /* Follow expired path, else get first next available. */ + rq = bfq_check_fifo(bfqq); + if (!rq) + rq = bfqq->next_rq; + service_to_charge = bfq_serv_to_charge(rq, bfqq); + + if (service_to_charge > bfq_bfqq_budget_left(bfqq)) { + /* + * This may happen if the next rq is chosen in fifo order + * instead of sector order. The budget is properly + * dimensioned to be always sufficient to serve the next + * request only if it is chosen in sector order. The reason + * is that it would be quite inefficient and little useful + * to always make sure that the budget is large enough to + * serve even the possible next rq in fifo order. + * In fact, requests are seldom served in fifo order. + * + * Expire the queue for budget exhaustion, and make sure + * that the next act_budget is enough to serve the next + * request, even if it comes from the fifo expired path. + */ + bfqq->next_rq = rq; + /* + * Since this dispatch is failed, make sure that + * a new one will be performed + */ + if (!bfqd->rq_in_driver) + bfq_schedule_dispatch(bfqd); + goto expire; + } + + /* Finally, insert request into driver dispatch list. */ + bfq_bfqq_served(bfqq, service_to_charge); + bfq_dispatch_insert(bfqd->queue, rq); + + bfq_update_wr_data(bfqd, bfqq); + + bfq_log_bfqq(bfqd, bfqq, + "dispatched %u sec req (%llu), budg left %d", + blk_rq_sectors(rq), + (long long unsigned)blk_rq_pos(rq), + bfq_bfqq_budget_left(bfqq)); + + dispatched++; + + if (!bfqd->in_service_bic) { + atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount); + bfqd->in_service_bic = RQ_BIC(rq); + } + + if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) && + dispatched >= bfqd->bfq_max_budget_async_rq) || + bfq_class_idle(bfqq))) + goto expire; + + return dispatched; + +expire: + bfq_bfqq_expire(bfqd, bfqq, false, BFQ_BFQQ_BUDGET_EXHAUSTED); + return dispatched; +} + +static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq) +{ + int dispatched = 0; + + while (bfqq->next_rq) { + bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq); + dispatched++; + } + + BUG_ON(!list_empty(&bfqq->fifo)); + return dispatched; +} + +/* + * Drain our current requests. + * Used for barriers and when switching io schedulers on-the-fly. + */ +static int bfq_forced_dispatch(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq, *n; + struct bfq_service_tree *st; + int dispatched = 0; + + bfqq = bfqd->in_service_queue; + if (bfqq) + __bfq_bfqq_expire(bfqd, bfqq); + + /* + * Loop through classes, and be careful to leave the scheduler + * in a consistent state, as feedback mechanisms and vtime + * updates cannot be disabled during the process. + */ + list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) { + st = bfq_entity_service_tree(&bfqq->entity); + + dispatched += __bfq_forced_dispatch_bfqq(bfqq); + bfqq->max_budget = bfq_max_budget(bfqd); + + bfq_forget_idle(st); + } + + BUG_ON(bfqd->busy_queues != 0); + + return dispatched; +} + +static int bfq_dispatch_requests(struct request_queue *q, int force) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + struct bfq_queue *bfqq; + int max_dispatch; + + bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues); + if (bfqd->busy_queues == 0) + return 0; + + if (unlikely(force)) + return bfq_forced_dispatch(bfqd); + + bfqq = bfq_select_queue(bfqd); + if (!bfqq) + return 0; + + if (bfq_class_idle(bfqq)) + max_dispatch = 1; + + if (!bfq_bfqq_sync(bfqq)) + max_dispatch = bfqd->bfq_max_budget_async_rq; + + if (!bfq_bfqq_sync(bfqq) && bfqq->dispatched >= max_dispatch) { + if (bfqd->busy_queues > 1) + return 0; + if (bfqq->dispatched >= 4 * max_dispatch) + return 0; + } + + if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq)) + return 0; + + bfq_clear_bfqq_wait_request(bfqq); + BUG_ON(timer_pending(&bfqd->idle_slice_timer)); + + if (!bfq_dispatch_request(bfqd, bfqq)) + return 0; + + bfq_log_bfqq(bfqd, bfqq, "dispatched %s request", + bfq_bfqq_sync(bfqq) ? "sync" : "async"); + + return 1; +} + +/* + * Task holds one reference to the queue, dropped when task exits. Each rq + * in-flight on this queue also holds a reference, dropped when rq is freed. + * + * Queue lock must be held here. + */ +static void bfq_put_queue(struct bfq_queue *bfqq) +{ + struct bfq_data *bfqd = bfqq->bfqd; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_group *bfqg = bfqq_group(bfqq); +#endif + + BUG_ON(atomic_read(&bfqq->ref) <= 0); + + bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq, + atomic_read(&bfqq->ref)); + if (!atomic_dec_and_test(&bfqq->ref)) + return; + + BUG_ON(rb_first(&bfqq->sort_list)); + BUG_ON(bfqq->allocated[READ] + bfqq->allocated[WRITE] != 0); + BUG_ON(bfqq->entity.tree); + BUG_ON(bfq_bfqq_busy(bfqq)); + BUG_ON(bfqd->in_service_queue == bfqq); + + if (bfq_bfqq_sync(bfqq)) + /* + * The fact that this queue is being destroyed does not + * invalidate the fact that this queue may have been + * activated during the current burst. As a consequence, + * although the queue does not exist anymore, and hence + * needs to be removed from the burst list if there, + * the burst size has not to be decremented. + */ + hlist_del_init(&bfqq->burst_list_node); + + bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq); + + kmem_cache_free(bfq_pool, bfqq); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_put(bfqg); +#endif +} + +static void bfq_put_cooperator(struct bfq_queue *bfqq) +{ + struct bfq_queue *__bfqq, *next; + + /* + * If this queue was scheduled to merge with another queue, be + * sure to drop the reference taken on that queue (and others in + * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs. + */ + __bfqq = bfqq->new_bfqq; + while (__bfqq) { + if (__bfqq == bfqq) + break; + next = __bfqq->new_bfqq; + bfq_put_queue(__bfqq); + __bfqq = next; + } +} + +static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + if (bfqq == bfqd->in_service_queue) { + __bfq_bfqq_expire(bfqd, bfqq); + bfq_schedule_dispatch(bfqd); + } + + bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, + atomic_read(&bfqq->ref)); + + bfq_put_cooperator(bfqq); + + bfq_put_queue(bfqq); +} + +static void bfq_init_icq(struct io_cq *icq) +{ + struct bfq_io_cq *bic = icq_to_bic(icq); + + bic->ttime.last_end_request = jiffies; + /* + * A newly created bic indicates that the process has just + * started doing I/O, and is probably mapping into memory its + * executable and libraries: it definitely needs weight raising. + * There is however the possibility that the process performs, + * for a while, I/O close to some other process. EQM intercepts + * this behavior and may merge the queue corresponding to the + * process with some other queue, BEFORE the weight of the queue + * is raised. Merged queues are not weight-raised (they are assumed + * to belong to processes that benefit only from high throughput). + * If the merge is basically the consequence of an accident, then + * the queue will be split soon and will get back its old weight. + * It is then important to write down somewhere that this queue + * does need weight raising, even if it did not make it to get its + * weight raised before being merged. To this purpose, we overload + * the field raising_time_left and assign 1 to it, to mark the queue + * as needing weight raising. + */ + bic->wr_time_left = 1; +} + +static void bfq_exit_icq(struct io_cq *icq) +{ + struct bfq_io_cq *bic = icq_to_bic(icq); + struct bfq_data *bfqd = bic_to_bfqd(bic); + + if (bic->bfqq[BLK_RW_ASYNC]) { + bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]); + bic->bfqq[BLK_RW_ASYNC] = NULL; + } + + if (bic->bfqq[BLK_RW_SYNC]) { + /* + * If the bic is using a shared queue, put the reference + * taken on the io_context when the bic started using a + * shared bfq_queue. + */ + if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC])) + put_io_context(icq->ioc); + bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]); + bic->bfqq[BLK_RW_SYNC] = NULL; + } +} + +/* + * Update the entity prio values; note that the new values will not + * be used until the next (re)activation. + */ +static void bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic) +{ + struct task_struct *tsk = current; + int ioprio_class; + + ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio); + switch (ioprio_class) { + default: + dev_err(bfqq->bfqd->queue->backing_dev_info.dev, + "bfq: bad prio class %d\n", ioprio_class); + case IOPRIO_CLASS_NONE: + /* + * No prio set, inherit CPU scheduling settings. + */ + bfqq->new_ioprio = task_nice_ioprio(tsk); + bfqq->new_ioprio_class = task_nice_ioclass(tsk); + break; + case IOPRIO_CLASS_RT: + bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + bfqq->new_ioprio_class = IOPRIO_CLASS_RT; + break; + case IOPRIO_CLASS_BE: + bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + bfqq->new_ioprio_class = IOPRIO_CLASS_BE; + break; + case IOPRIO_CLASS_IDLE: + bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE; + bfqq->new_ioprio = 7; + bfq_clear_bfqq_idle_window(bfqq); + break; + } + + if (bfqq->new_ioprio < 0 || bfqq->new_ioprio >= IOPRIO_BE_NR) { + printk(KERN_CRIT "bfq_set_next_ioprio_data: new_ioprio %d\n", + bfqq->new_ioprio); + BUG(); + } + + bfqq->entity.new_weight = bfq_ioprio_to_weight(bfqq->new_ioprio); + bfqq->entity.prio_changed = 1; +} + +static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio) +{ + struct bfq_data *bfqd; + struct bfq_queue *bfqq, *new_bfqq; + unsigned long uninitialized_var(flags); + int ioprio = bic->icq.ioc->ioprio; + + bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data), + &flags); + /* + * This condition may trigger on a newly created bic, be sure to + * drop the lock before returning. + */ + if (unlikely(!bfqd) || likely(bic->ioprio == ioprio)) + goto out; + + bic->ioprio = ioprio; + + bfqq = bic->bfqq[BLK_RW_ASYNC]; + if (bfqq) { + new_bfqq = bfq_get_queue(bfqd, bio, BLK_RW_ASYNC, bic, + GFP_ATOMIC); + if (new_bfqq) { + bic->bfqq[BLK_RW_ASYNC] = new_bfqq; + bfq_log_bfqq(bfqd, bfqq, + "check_ioprio_change: bfqq %p %d", + bfqq, atomic_read(&bfqq->ref)); + bfq_put_queue(bfqq); + } + } + + bfqq = bic->bfqq[BLK_RW_SYNC]; + if (bfqq) + bfq_set_next_ioprio_data(bfqq, bic); + +out: + bfq_put_bfqd_unlock(bfqd, &flags); +} + +static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, + struct bfq_io_cq *bic, pid_t pid, int is_sync) +{ + RB_CLEAR_NODE(&bfqq->entity.rb_node); + INIT_LIST_HEAD(&bfqq->fifo); + INIT_HLIST_NODE(&bfqq->burst_list_node); + + atomic_set(&bfqq->ref, 0); + bfqq->bfqd = bfqd; + + if (bic) + bfq_set_next_ioprio_data(bfqq, bic); + + if (is_sync) { + if (!bfq_class_idle(bfqq)) + bfq_mark_bfqq_idle_window(bfqq); + bfq_mark_bfqq_sync(bfqq); + } else + bfq_clear_bfqq_sync(bfqq); + bfq_mark_bfqq_IO_bound(bfqq); + + /* Tentative initial value to trade off between thr and lat */ + bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3; + bfqq->pid = pid; + + bfqq->wr_coeff = 1; + bfqq->last_wr_start_finish = 0; + /* + * Set to the value for which bfqq will not be deemed as + * soft rt when it becomes backlogged. + */ + bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies); +} + +static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd, + struct bio *bio, int is_sync, + struct bfq_io_cq *bic, + gfp_t gfp_mask) +{ + struct bfq_group *bfqg; + struct bfq_queue *bfqq, *new_bfqq = NULL; + struct blkcg *blkcg; + +retry: + rcu_read_lock(); + + blkcg = bio_blkcg(bio); + bfqg = bfq_find_alloc_group(bfqd, blkcg); + /* bic always exists here */ + bfqq = bic_to_bfqq(bic, is_sync); + + /* + * Always try a new alloc if we fall back to the OOM bfqq + * originally, since it should just be a temporary situation. + */ + if (!bfqq || bfqq == &bfqd->oom_bfqq) { + bfqq = NULL; + if (new_bfqq) { + bfqq = new_bfqq; + new_bfqq = NULL; + } else if (gfpflags_allow_blocking(gfp_mask)) { + rcu_read_unlock(); + spin_unlock_irq(bfqd->queue->queue_lock); + new_bfqq = kmem_cache_alloc_node(bfq_pool, + gfp_mask | __GFP_ZERO, + bfqd->queue->node); + spin_lock_irq(bfqd->queue->queue_lock); + if (new_bfqq) + goto retry; + } else { + bfqq = kmem_cache_alloc_node(bfq_pool, + gfp_mask | __GFP_ZERO, + bfqd->queue->node); + } + + if (bfqq) { + bfq_init_bfqq(bfqd, bfqq, bic, current->pid, + is_sync); + bfq_init_entity(&bfqq->entity, bfqg); + bfq_log_bfqq(bfqd, bfqq, "allocated"); + } else { + bfqq = &bfqd->oom_bfqq; + bfq_log_bfqq(bfqd, bfqq, "using oom bfqq"); + } + } + + if (new_bfqq) + kmem_cache_free(bfq_pool, new_bfqq); + + rcu_read_unlock(); + + return bfqq; +} + +static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd, + struct bfq_group *bfqg, + int ioprio_class, int ioprio) +{ + switch (ioprio_class) { + case IOPRIO_CLASS_RT: + return &bfqg->async_bfqq[0][ioprio]; + case IOPRIO_CLASS_NONE: + ioprio = IOPRIO_NORM; + /* fall through */ + case IOPRIO_CLASS_BE: + return &bfqg->async_bfqq[1][ioprio]; + case IOPRIO_CLASS_IDLE: + return &bfqg->async_idle_bfqq; + default: + BUG(); + } +} + +static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, + struct bio *bio, int is_sync, + struct bfq_io_cq *bic, gfp_t gfp_mask) +{ + const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio); + struct bfq_queue **async_bfqq = NULL; + struct bfq_queue *bfqq = NULL; + + if (!is_sync) { + struct blkcg *blkcg; + struct bfq_group *bfqg; + + rcu_read_lock(); + blkcg = bio_blkcg(bio); + rcu_read_unlock(); + bfqg = bfq_find_alloc_group(bfqd, blkcg); + async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class, + ioprio); + bfqq = *async_bfqq; + } + + if (!bfqq) + bfqq = bfq_find_alloc_queue(bfqd, bio, is_sync, bic, gfp_mask); + + /* + * Pin the queue now that it's allocated, scheduler exit will + * prune it. + */ + if (!is_sync && !(*async_bfqq)) { + atomic_inc(&bfqq->ref); + bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d", + bfqq, atomic_read(&bfqq->ref)); + *async_bfqq = bfqq; + } + + atomic_inc(&bfqq->ref); + bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq, + atomic_read(&bfqq->ref)); + return bfqq; +} + +static void bfq_update_io_thinktime(struct bfq_data *bfqd, + struct bfq_io_cq *bic) +{ + unsigned long elapsed = jiffies - bic->ttime.last_end_request; + unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle); + + bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8; + bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8; + bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) / + bic->ttime.ttime_samples; +} + +static void bfq_update_io_seektime(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + struct request *rq) +{ + sector_t sdist; + u64 total; + + if (bfqq->last_request_pos < blk_rq_pos(rq)) + sdist = blk_rq_pos(rq) - bfqq->last_request_pos; + else + sdist = bfqq->last_request_pos - blk_rq_pos(rq); + + /* + * Don't allow the seek distance to get too large from the + * odd fragment, pagein, etc. + */ + if (bfqq->seek_samples == 0) /* first request, not really a seek */ + sdist = 0; + else if (bfqq->seek_samples <= 60) /* second & third seek */ + sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024); + else + sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64); + + bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8; + bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8; + total = bfqq->seek_total + (bfqq->seek_samples/2); + do_div(total, bfqq->seek_samples); + bfqq->seek_mean = (sector_t)total; + + bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist, + (u64)bfqq->seek_mean); +} + +/* + * Disable idle window if the process thinks too long or seeks so much that + * it doesn't matter. + */ +static void bfq_update_idle_window(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + struct bfq_io_cq *bic) +{ + int enable_idle; + + /* Don't idle for async or idle io prio class. */ + if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq)) + return; + + /* Idle window just restored, statistics are meaningless. */ + if (bfq_bfqq_just_split(bfqq)) + return; + + enable_idle = bfq_bfqq_idle_window(bfqq); + + if (atomic_read(&bic->icq.ioc->active_ref) == 0 || + bfqd->bfq_slice_idle == 0 || + (bfqd->hw_tag && BFQQ_SEEKY(bfqq) && + bfqq->wr_coeff == 1)) + enable_idle = 0; + else if (bfq_sample_valid(bic->ttime.ttime_samples)) { + if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle && + bfqq->wr_coeff == 1) + enable_idle = 0; + else + enable_idle = 1; + } + bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d", + enable_idle); + + if (enable_idle) + bfq_mark_bfqq_idle_window(bfqq); + else + bfq_clear_bfqq_idle_window(bfqq); +} + +/* + * Called when a new fs request (rq) is added to bfqq. Check if there's + * something we should do about it. + */ +static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq, + struct request *rq) +{ + struct bfq_io_cq *bic = RQ_BIC(rq); + + if (rq->cmd_flags & REQ_META) + bfqq->meta_pending++; + + bfq_update_io_thinktime(bfqd, bic); + bfq_update_io_seektime(bfqd, bfqq, rq); + if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) { + bfq_clear_bfqq_constantly_seeky(bfqq); + if (!blk_queue_nonrot(bfqd->queue)) { + BUG_ON(!bfqd->const_seeky_busy_in_flight_queues); + bfqd->const_seeky_busy_in_flight_queues--; + } + } + if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 || + !BFQQ_SEEKY(bfqq)) + bfq_update_idle_window(bfqd, bfqq, bic); + bfq_clear_bfqq_just_split(bfqq); + + bfq_log_bfqq(bfqd, bfqq, + "rq_enqueued: idle_window=%d (seeky %d, mean %llu)", + bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq), + (long long unsigned)bfqq->seek_mean); + + bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq); + + if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) { + bool small_req = bfqq->queued[rq_is_sync(rq)] == 1 && + blk_rq_sectors(rq) < 32; + bool budget_timeout = bfq_bfqq_budget_timeout(bfqq); + + /* + * There is just this request queued: if the request + * is small and the queue is not to be expired, then + * just exit. + * + * In this way, if the disk is being idled to wait for + * a new request from the in-service queue, we avoid + * unplugging the device and committing the disk to serve + * just a small request. On the contrary, we wait for + * the block layer to decide when to unplug the device: + * hopefully, new requests will be merged to this one + * quickly, then the device will be unplugged and + * larger requests will be dispatched. + */ + if (small_req && !budget_timeout) + return; + + /* + * A large enough request arrived, or the queue is to + * be expired: in both cases disk idling is to be + * stopped, so clear wait_request flag and reset + * timer. + */ + bfq_clear_bfqq_wait_request(bfqq); + del_timer(&bfqd->idle_slice_timer); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_update_idle_time(bfqq_group(bfqq)); +#endif + + /* + * The queue is not empty, because a new request just + * arrived. Hence we can safely expire the queue, in + * case of budget timeout, without risking that the + * timestamps of the queue are not updated correctly. + * See [1] for more details. + */ + if (budget_timeout) + bfq_bfqq_expire(bfqd, bfqq, false, + BFQ_BFQQ_BUDGET_TIMEOUT); + + /* + * Let the request rip immediately, or let a new queue be + * selected if bfqq has just been expired. + */ + __blk_run_queue(bfqd->queue); + } +} + +static void bfq_insert_request(struct request_queue *q, struct request *rq) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq; + + assert_spin_locked(bfqd->queue->queue_lock); + + /* + * An unplug may trigger a requeue of a request from the device + * driver: make sure we are in process context while trying to + * merge two bfq_queues. + */ + if (!in_interrupt()) { + new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true); + if (new_bfqq) { + if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq) + new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1); + /* + * Release the request's reference to the old bfqq + * and make sure one is taken to the shared queue. + */ + new_bfqq->allocated[rq_data_dir(rq)]++; + bfqq->allocated[rq_data_dir(rq)]--; + atomic_inc(&new_bfqq->ref); + bfq_put_queue(bfqq); + if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq) + bfq_merge_bfqqs(bfqd, RQ_BIC(rq), + bfqq, new_bfqq); + rq->elv.priv[1] = new_bfqq; + bfqq = new_bfqq; + } else + bfq_bfqq_increase_failed_cooperations(bfqq); + } + + bfq_add_request(rq); + + /* + * Here a newly-created bfq_queue has already started a weight-raising + * period: clear raising_time_left to prevent bfq_bfqq_save_state() + * from assigning it a full weight-raising period. See the detailed + * comments about this field in bfq_init_icq(). + */ + if (bfqq->bic) + bfqq->bic->wr_time_left = 0; + rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)]; + list_add_tail(&rq->queuelist, &bfqq->fifo); + + bfq_rq_enqueued(bfqd, bfqq, rq); +} + +static void bfq_update_hw_tag(struct bfq_data *bfqd) +{ + bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver, + bfqd->rq_in_driver); + + if (bfqd->hw_tag == 1) + return; + + /* + * This sample is valid if the number of outstanding requests + * is large enough to allow a queueing behavior. Note that the + * sum is not exact, as it's not taking into account deactivated + * requests. + */ + if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD) + return; + + if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES) + return; + + bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD; + bfqd->max_rq_in_driver = 0; + bfqd->hw_tag_samples = 0; +} + +static void bfq_completed_request(struct request_queue *q, struct request *rq) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq); + struct bfq_data *bfqd = bfqq->bfqd; + bool sync = bfq_bfqq_sync(bfqq); + + bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)", + blk_rq_sectors(rq), sync); + + bfq_update_hw_tag(bfqd); + + BUG_ON(!bfqd->rq_in_driver); + BUG_ON(!bfqq->dispatched); + bfqd->rq_in_driver--; + bfqq->dispatched--; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_update_completion(bfqq_group(bfqq), + rq_start_time_ns(rq), + rq_io_start_time_ns(rq), rq->cmd_flags); +#endif + + if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) { + bfq_weights_tree_remove(bfqd, &bfqq->entity, + &bfqd->queue_weights_tree); + if (!blk_queue_nonrot(bfqd->queue)) { + BUG_ON(!bfqd->busy_in_flight_queues); + bfqd->busy_in_flight_queues--; + if (bfq_bfqq_constantly_seeky(bfqq)) { + BUG_ON(!bfqd-> + const_seeky_busy_in_flight_queues); + bfqd->const_seeky_busy_in_flight_queues--; + } + } + } + + if (sync) { + bfqd->sync_flight--; + RQ_BIC(rq)->ttime.last_end_request = jiffies; + } + + /* + * If we are waiting to discover whether the request pattern of the + * task associated with the queue is actually isochronous, and + * both requisites for this condition to hold are satisfied, then + * compute soft_rt_next_start (see the comments to the function + * bfq_bfqq_softrt_next_start()). + */ + if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 && + RB_EMPTY_ROOT(&bfqq->sort_list)) + bfqq->soft_rt_next_start = + bfq_bfqq_softrt_next_start(bfqd, bfqq); + + /* + * If this is the in-service queue, check if it needs to be expired, + * or if we want to idle in case it has no pending requests. + */ + if (bfqd->in_service_queue == bfqq) { + if (bfq_bfqq_budget_new(bfqq)) + bfq_set_budget_timeout(bfqd); + + if (bfq_bfqq_must_idle(bfqq)) { + bfq_arm_slice_timer(bfqd); + goto out; + } else if (bfq_may_expire_for_budg_timeout(bfqq)) + bfq_bfqq_expire(bfqd, bfqq, false, + BFQ_BFQQ_BUDGET_TIMEOUT); + else if (RB_EMPTY_ROOT(&bfqq->sort_list) && + (bfqq->dispatched == 0 || + !bfq_bfqq_may_idle(bfqq))) + bfq_bfqq_expire(bfqd, bfqq, false, + BFQ_BFQQ_NO_MORE_REQUESTS); + } + + if (!bfqd->rq_in_driver) + bfq_schedule_dispatch(bfqd); + +out: + return; +} + +static int __bfq_may_queue(struct bfq_queue *bfqq) +{ + if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) { + bfq_clear_bfqq_must_alloc(bfqq); + return ELV_MQUEUE_MUST; + } + + return ELV_MQUEUE_MAY; +} + +static int bfq_may_queue(struct request_queue *q, int rw) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + struct task_struct *tsk = current; + struct bfq_io_cq *bic; + struct bfq_queue *bfqq; + + /* + * Don't force setup of a queue from here, as a call to may_queue + * does not necessarily imply that a request actually will be + * queued. So just lookup a possibly existing queue, or return + * 'may queue' if that fails. + */ + bic = bfq_bic_lookup(bfqd, tsk->io_context); + if (!bic) + return ELV_MQUEUE_MAY; + + bfqq = bic_to_bfqq(bic, rw_is_sync(rw)); + if (bfqq) + return __bfq_may_queue(bfqq); + + return ELV_MQUEUE_MAY; +} + +/* + * Queue lock held here. + */ +static void bfq_put_request(struct request *rq) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq); + + if (bfqq) { + const int rw = rq_data_dir(rq); + + BUG_ON(!bfqq->allocated[rw]); + bfqq->allocated[rw]--; + + rq->elv.priv[0] = NULL; + rq->elv.priv[1] = NULL; + + bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d", + bfqq, atomic_read(&bfqq->ref)); + bfq_put_queue(bfqq); + } +} + +/* + * Returns NULL if a new bfqq should be allocated, or the old bfqq if this + * was the last process referring to said bfqq. + */ +static struct bfq_queue * +bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq) +{ + bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue"); + + put_io_context(bic->icq.ioc); + + if (bfqq_process_refs(bfqq) == 1) { + bfqq->pid = current->pid; + bfq_clear_bfqq_coop(bfqq); + bfq_clear_bfqq_split_coop(bfqq); + return bfqq; + } + + bic_set_bfqq(bic, NULL, 1); + + bfq_put_cooperator(bfqq); + + bfq_put_queue(bfqq); + return NULL; +} + +/* + * Allocate bfq data structures associated with this request. + */ +static int bfq_set_request(struct request_queue *q, struct request *rq, + struct bio *bio, gfp_t gfp_mask) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq); + const int rw = rq_data_dir(rq); + const int is_sync = rq_is_sync(rq); + struct bfq_queue *bfqq; + unsigned long flags; + bool split = false; + + might_sleep_if(gfpflags_allow_blocking(gfp_mask)); + + bfq_check_ioprio_change(bic, bio); + + spin_lock_irqsave(q->queue_lock, flags); + + if (!bic) + goto queue_fail; + + bfq_bic_update_cgroup(bic, bio); + +new_queue: + bfqq = bic_to_bfqq(bic, is_sync); + if (!bfqq || bfqq == &bfqd->oom_bfqq) { + bfqq = bfq_get_queue(bfqd, bio, is_sync, bic, gfp_mask); + bic_set_bfqq(bic, bfqq, is_sync); + if (split && is_sync) { + if ((bic->was_in_burst_list && bfqd->large_burst) || + bic->saved_in_large_burst) + bfq_mark_bfqq_in_large_burst(bfqq); + else { + bfq_clear_bfqq_in_large_burst(bfqq); + if (bic->was_in_burst_list) + hlist_add_head(&bfqq->burst_list_node, + &bfqd->burst_list); + } + } + } else { + /* If the queue was seeky for too long, break it apart. */ + if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) { + bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq"); + bfqq = bfq_split_bfqq(bic, bfqq); + split = true; + if (!bfqq) + goto new_queue; + } + } + + bfqq->allocated[rw]++; + atomic_inc(&bfqq->ref); + bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq, + atomic_read(&bfqq->ref)); + + rq->elv.priv[0] = bic; + rq->elv.priv[1] = bfqq; + + /* + * If a bfq_queue has only one process reference, it is owned + * by only one bfq_io_cq: we can set the bic field of the + * bfq_queue to the address of that structure. Also, if the + * queue has just been split, mark a flag so that the + * information is available to the other scheduler hooks. + */ + if (likely(bfqq != &bfqd->oom_bfqq) && bfqq_process_refs(bfqq) == 1) { + bfqq->bic = bic; + if (split) { + bfq_mark_bfqq_just_split(bfqq); + /* + * If the queue has just been split from a shared + * queue, restore the idle window and the possible + * weight raising period. + */ + bfq_bfqq_resume_state(bfqq, bic); + } + } + + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; + +queue_fail: + bfq_schedule_dispatch(bfqd); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 1; +} + +static void bfq_kick_queue(struct work_struct *work) +{ + struct bfq_data *bfqd = + container_of(work, struct bfq_data, unplug_work); + struct request_queue *q = bfqd->queue; + + spin_lock_irq(q->queue_lock); + __blk_run_queue(q); + spin_unlock_irq(q->queue_lock); +} + +/* + * Handler of the expiration of the timer running if the in-service queue + * is idling inside its time slice. + */ +static void bfq_idle_slice_timer(unsigned long data) +{ + struct bfq_data *bfqd = (struct bfq_data *)data; + struct bfq_queue *bfqq; + unsigned long flags; + enum bfqq_expiration reason; + + spin_lock_irqsave(bfqd->queue->queue_lock, flags); + + bfqq = bfqd->in_service_queue; + /* + * Theoretical race here: the in-service queue can be NULL or + * different from the queue that was idling if the timer handler + * spins on the queue_lock and a new request arrives for the + * current queue and there is a full dispatch cycle that changes + * the in-service queue. This can hardly happen, but in the worst + * case we just expire a queue too early. + */ + if (bfqq) { + bfq_log_bfqq(bfqd, bfqq, "slice_timer expired"); + if (bfq_bfqq_budget_timeout(bfqq)) + /* + * Also here the queue can be safely expired + * for budget timeout without wasting + * guarantees + */ + reason = BFQ_BFQQ_BUDGET_TIMEOUT; + else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0) + /* + * The queue may not be empty upon timer expiration, + * because we may not disable the timer when the + * first request of the in-service queue arrives + * during disk idling. + */ + reason = BFQ_BFQQ_TOO_IDLE; + else + goto schedule_dispatch; + + bfq_bfqq_expire(bfqd, bfqq, true, reason); + } + +schedule_dispatch: + bfq_schedule_dispatch(bfqd); + + spin_unlock_irqrestore(bfqd->queue->queue_lock, flags); +} + +static void bfq_shutdown_timer_wq(struct bfq_data *bfqd) +{ + del_timer_sync(&bfqd->idle_slice_timer); + cancel_work_sync(&bfqd->unplug_work); +} + +static void __bfq_put_async_bfqq(struct bfq_data *bfqd, + struct bfq_queue **bfqq_ptr) +{ + struct bfq_group *root_group = bfqd->root_group; + struct bfq_queue *bfqq = *bfqq_ptr; + + bfq_log(bfqd, "put_async_bfqq: %p", bfqq); + if (bfqq) { + bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group); + bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d", + bfqq, atomic_read(&bfqq->ref)); + bfq_put_queue(bfqq); + *bfqq_ptr = NULL; + } +} + +/* + * Release all the bfqg references to its async queues. If we are + * deallocating the group these queues may still contain requests, so + * we reparent them to the root cgroup (i.e., the only one that will + * exist for sure until all the requests on a device are gone). + */ +static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg) +{ + int i, j; + + for (i = 0; i < 2; i++) + for (j = 0; j < IOPRIO_BE_NR; j++) + __bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]); + + __bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq); +} + +static void bfq_exit_queue(struct elevator_queue *e) +{ + struct bfq_data *bfqd = e->elevator_data; + struct request_queue *q = bfqd->queue; + struct bfq_queue *bfqq, *n; + + bfq_shutdown_timer_wq(bfqd); + + spin_lock_irq(q->queue_lock); + + BUG_ON(bfqd->in_service_queue); + list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list) + bfq_deactivate_bfqq(bfqd, bfqq, 0); + + spin_unlock_irq(q->queue_lock); + + bfq_shutdown_timer_wq(bfqd); + + synchronize_rcu(); + + BUG_ON(timer_pending(&bfqd->idle_slice_timer)); + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + blkcg_deactivate_policy(q, &blkcg_policy_bfq); +#else + kfree(bfqd->root_group); +#endif + + kfree(bfqd); +} + +static void bfq_init_root_group(struct bfq_group *root_group, + struct bfq_data *bfqd) +{ + int i; + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + root_group->entity.parent = NULL; + root_group->my_entity = NULL; + root_group->bfqd = bfqd; +#endif + root_group->rq_pos_tree = RB_ROOT; + for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) + root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT; +} + +static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) +{ + struct bfq_data *bfqd; + struct elevator_queue *eq; + + eq = elevator_alloc(q, e); + if (!eq) + return -ENOMEM; + + bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node); + if (!bfqd) { + kobject_put(&eq->kobj); + return -ENOMEM; + } + eq->elevator_data = bfqd; + + /* + * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues. + * Grab a permanent reference to it, so that the normal code flow + * will not attempt to free it. + */ + bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, NULL, 1, 0); + atomic_inc(&bfqd->oom_bfqq.ref); + bfqd->oom_bfqq.new_ioprio = BFQ_DEFAULT_QUEUE_IOPRIO; + bfqd->oom_bfqq.new_ioprio_class = IOPRIO_CLASS_BE; + bfqd->oom_bfqq.entity.new_weight = + bfq_ioprio_to_weight(bfqd->oom_bfqq.new_ioprio); + /* + * Trigger weight initialization, according to ioprio, at the + * oom_bfqq's first activation. The oom_bfqq's ioprio and ioprio + * class won't be changed any more. + */ + bfqd->oom_bfqq.entity.prio_changed = 1; + + bfqd->queue = q; + + spin_lock_irq(q->queue_lock); + q->elevator = eq; + spin_unlock_irq(q->queue_lock); + + bfqd->root_group = bfq_create_group_hierarchy(bfqd, q->node); + if (!bfqd->root_group) + goto out_free; + bfq_init_root_group(bfqd->root_group, bfqd); + bfq_init_entity(&bfqd->oom_bfqq.entity, bfqd->root_group); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqd->active_numerous_groups = 0; +#endif + + init_timer(&bfqd->idle_slice_timer); + bfqd->idle_slice_timer.function = bfq_idle_slice_timer; + bfqd->idle_slice_timer.data = (unsigned long)bfqd; + + bfqd->queue_weights_tree = RB_ROOT; + bfqd->group_weights_tree = RB_ROOT; + + INIT_WORK(&bfqd->unplug_work, bfq_kick_queue); + + INIT_LIST_HEAD(&bfqd->active_list); + INIT_LIST_HEAD(&bfqd->idle_list); + INIT_HLIST_HEAD(&bfqd->burst_list); + + bfqd->hw_tag = -1; + + bfqd->bfq_max_budget = bfq_default_max_budget; + + bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0]; + bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1]; + bfqd->bfq_back_max = bfq_back_max; + bfqd->bfq_back_penalty = bfq_back_penalty; + bfqd->bfq_slice_idle = bfq_slice_idle; + bfqd->bfq_class_idle_last_service = 0; + bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq; + bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async; + bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync; + + bfqd->bfq_coop_thresh = 2; + bfqd->bfq_failed_cooperations = 7000; + bfqd->bfq_requests_within_timer = 120; + + bfqd->bfq_large_burst_thresh = 11; + bfqd->bfq_burst_interval = msecs_to_jiffies(500); + + bfqd->low_latency = true; + + bfqd->bfq_wr_coeff = 20; + bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300); + bfqd->bfq_wr_max_time = 0; + bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000); + bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500); + bfqd->bfq_wr_max_softrt_rate = 7000; /* + * Approximate rate required + * to playback or record a + * high-definition compressed + * video. + */ + bfqd->wr_busy_queues = 0; + bfqd->busy_in_flight_queues = 0; + bfqd->const_seeky_busy_in_flight_queues = 0; + + /* + * Begin by assuming, optimistically, that the device peak rate is + * equal to the highest reference rate. + */ + bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] * + T_fast[blk_queue_nonrot(bfqd->queue)]; + bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)]; + bfqd->device_speed = BFQ_BFQD_FAST; + + return 0; + +out_free: + kfree(bfqd); + kobject_put(&eq->kobj); + return -ENOMEM; +} + +static void bfq_slab_kill(void) +{ + if (bfq_pool) + kmem_cache_destroy(bfq_pool); +} + +static int __init bfq_slab_setup(void) +{ + bfq_pool = KMEM_CACHE(bfq_queue, 0); + if (!bfq_pool) + return -ENOMEM; + return 0; +} + +static ssize_t bfq_var_show(unsigned int var, char *page) +{ + return sprintf(page, "%d\n", var); +} + +static ssize_t bfq_var_store(unsigned long *var, const char *page, + size_t count) +{ + unsigned long new_val; + int ret = kstrtoul(page, 10, &new_val); + + if (ret == 0) + *var = new_val; + + return count; +} + +static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page) +{ + struct bfq_data *bfqd = e->elevator_data; + return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ? + jiffies_to_msecs(bfqd->bfq_wr_max_time) : + jiffies_to_msecs(bfq_wr_duration(bfqd))); +} + +static ssize_t bfq_weights_show(struct elevator_queue *e, char *page) +{ + struct bfq_queue *bfqq; + struct bfq_data *bfqd = e->elevator_data; + ssize_t num_char = 0; + + num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n", + bfqd->queued); + + spin_lock_irq(bfqd->queue->queue_lock); + + num_char += sprintf(page + num_char, "Active:\n"); + list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) { + num_char += sprintf(page + num_char, + "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n", + bfqq->pid, + bfqq->entity.weight, + bfqq->queued[0], + bfqq->queued[1], + jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish), + jiffies_to_msecs(bfqq->wr_cur_max_time)); + } + + num_char += sprintf(page + num_char, "Idle:\n"); + list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) { + num_char += sprintf(page + num_char, + "pid%d: weight %hu, dur %d/%u\n", + bfqq->pid, + bfqq->entity.weight, + jiffies_to_msecs(jiffies - + bfqq->last_wr_start_finish), + jiffies_to_msecs(bfqq->wr_cur_max_time)); + } + + spin_unlock_irq(bfqd->queue->queue_lock); + + return num_char; +} + +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \ +static ssize_t __FUNC(struct elevator_queue *e, char *page) \ +{ \ + struct bfq_data *bfqd = e->elevator_data; \ + unsigned int __data = __VAR; \ + if (__CONV) \ + __data = jiffies_to_msecs(__data); \ + return bfq_var_show(__data, (page)); \ +} +SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1); +SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1); +SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0); +SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0); +SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1); +SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0); +SHOW_FUNCTION(bfq_max_budget_async_rq_show, + bfqd->bfq_max_budget_async_rq, 0); +SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1); +SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1); +SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0); +SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0); +SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1); +SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1); +SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async, + 1); +SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0); +#undef SHOW_FUNCTION + +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ +static ssize_t \ +__FUNC(struct elevator_queue *e, const char *page, size_t count) \ +{ \ + struct bfq_data *bfqd = e->elevator_data; \ + unsigned long uninitialized_var(__data); \ + int ret = bfq_var_store(&__data, (page), count); \ + if (__data < (MIN)) \ + __data = (MIN); \ + else if (__data > (MAX)) \ + __data = (MAX); \ + if (__CONV) \ + *(__PTR) = msecs_to_jiffies(__data); \ + else \ + *(__PTR) = __data; \ + return ret; \ +} +STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1, + INT_MAX, 1); +STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1, + INT_MAX, 1); +STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0); +STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1, + INT_MAX, 0); +STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1); +STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq, + 1, INT_MAX, 0); +STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0, + INT_MAX, 1); +STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0); +STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1); +STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX, + 1); +STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0, + INT_MAX, 1); +STORE_FUNCTION(bfq_wr_min_inter_arr_async_store, + &bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1); +STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0, + INT_MAX, 0); +#undef STORE_FUNCTION + +/* do nothing for the moment */ +static ssize_t bfq_weights_store(struct elevator_queue *e, + const char *page, size_t count) +{ + return count; +} + +static unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd) +{ + u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]); + + if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES) + return bfq_calc_max_budget(bfqd->peak_rate, timeout); + else + return bfq_default_max_budget; +} + +static ssize_t bfq_max_budget_store(struct elevator_queue *e, + const char *page, size_t count) +{ + struct bfq_data *bfqd = e->elevator_data; + unsigned long uninitialized_var(__data); + int ret = bfq_var_store(&__data, (page), count); + + if (__data == 0) + bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd); + else { + if (__data > INT_MAX) + __data = INT_MAX; + bfqd->bfq_max_budget = __data; + } + + bfqd->bfq_user_max_budget = __data; + + return ret; +} + +static ssize_t bfq_timeout_sync_store(struct elevator_queue *e, + const char *page, size_t count) +{ + struct bfq_data *bfqd = e->elevator_data; + unsigned long uninitialized_var(__data); + int ret = bfq_var_store(&__data, (page), count); + + if (__data < 1) + __data = 1; + else if (__data > INT_MAX) + __data = INT_MAX; + + bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data); + if (bfqd->bfq_user_max_budget == 0) + bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd); + + return ret; +} + +static ssize_t bfq_low_latency_store(struct elevator_queue *e, + const char *page, size_t count) +{ + struct bfq_data *bfqd = e->elevator_data; + unsigned long uninitialized_var(__data); + int ret = bfq_var_store(&__data, (page), count); + + if (__data > 1) + __data = 1; + if (__data == 0 && bfqd->low_latency != 0) + bfq_end_wr(bfqd); + bfqd->low_latency = __data; + + return ret; +} + +#define BFQ_ATTR(name) \ + __ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store) + +static struct elv_fs_entry bfq_attrs[] = { + BFQ_ATTR(fifo_expire_sync), + BFQ_ATTR(fifo_expire_async), + BFQ_ATTR(back_seek_max), + BFQ_ATTR(back_seek_penalty), + BFQ_ATTR(slice_idle), + BFQ_ATTR(max_budget), + BFQ_ATTR(max_budget_async_rq), + BFQ_ATTR(timeout_sync), + BFQ_ATTR(timeout_async), + BFQ_ATTR(low_latency), + BFQ_ATTR(wr_coeff), + BFQ_ATTR(wr_max_time), + BFQ_ATTR(wr_rt_max_time), + BFQ_ATTR(wr_min_idle_time), + BFQ_ATTR(wr_min_inter_arr_async), + BFQ_ATTR(wr_max_softrt_rate), + BFQ_ATTR(weights), + __ATTR_NULL +}; + +static struct elevator_type iosched_bfq = { + .ops = { + .elevator_merge_fn = bfq_merge, + .elevator_merged_fn = bfq_merged_request, + .elevator_merge_req_fn = bfq_merged_requests, +#ifdef CONFIG_BFQ_GROUP_IOSCHED + .elevator_bio_merged_fn = bfq_bio_merged, +#endif + .elevator_allow_merge_fn = bfq_allow_merge, + .elevator_dispatch_fn = bfq_dispatch_requests, + .elevator_add_req_fn = bfq_insert_request, + .elevator_activate_req_fn = bfq_activate_request, + .elevator_deactivate_req_fn = bfq_deactivate_request, + .elevator_completed_req_fn = bfq_completed_request, + .elevator_former_req_fn = elv_rb_former_request, + .elevator_latter_req_fn = elv_rb_latter_request, + .elevator_init_icq_fn = bfq_init_icq, + .elevator_exit_icq_fn = bfq_exit_icq, + .elevator_set_req_fn = bfq_set_request, + .elevator_put_req_fn = bfq_put_request, + .elevator_may_queue_fn = bfq_may_queue, + .elevator_init_fn = bfq_init_queue, + .elevator_exit_fn = bfq_exit_queue, + }, + .icq_size = sizeof(struct bfq_io_cq), + .icq_align = __alignof__(struct bfq_io_cq), + .elevator_attrs = bfq_attrs, + .elevator_name = "bfq", + .elevator_owner = THIS_MODULE, +}; + +static int __init bfq_init(void) +{ + int ret; + + /* + * Can be 0 on HZ < 1000 setups. + */ + if (bfq_slice_idle == 0) + bfq_slice_idle = 1; + + if (bfq_timeout_async == 0) + bfq_timeout_async = 1; + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + ret = blkcg_policy_register(&blkcg_policy_bfq); + if (ret) + return ret; +#endif + + ret = -ENOMEM; + if (bfq_slab_setup()) + goto err_pol_unreg; + + /* + * Times to load large popular applications for the typical systems + * installed on the reference devices (see the comments before the + * definitions of the two arrays). + */ + T_slow[0] = msecs_to_jiffies(2600); + T_slow[1] = msecs_to_jiffies(1000); + T_fast[0] = msecs_to_jiffies(5500); + T_fast[1] = msecs_to_jiffies(2000); + + /* + * Thresholds that determine the switch between speed classes (see + * the comments before the definition of the array). + */ + device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2; + device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2; + + ret = elv_register(&iosched_bfq); + if (ret) + goto err_pol_unreg; + + pr_info("BFQ I/O-scheduler: v7r11"); + + return 0; + +err_pol_unreg: +#ifdef CONFIG_BFQ_GROUP_IOSCHED + blkcg_policy_unregister(&blkcg_policy_bfq); +#endif + return ret; +} + +static void __exit bfq_exit(void) +{ + elv_unregister(&iosched_bfq); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + blkcg_policy_unregister(&blkcg_policy_bfq); +#endif + bfq_slab_kill(); +} + +module_init(bfq_init); +module_exit(bfq_exit); + +MODULE_AUTHOR("Arianna Avanzini, Fabio Checconi, Paolo Valente"); +MODULE_LICENSE("GPL"); diff --git b/block/bfq-sched.c b/block/bfq-sched.c new file mode 100644 index 0000000..a64fec1 --- /dev/null +++ b/block/bfq-sched.c @@ -0,0 +1,1200 @@ +/* + * BFQ: Hierarchical B-WF2Q+ scheduler. + * + * Based on ideas and code from CFQ: + * Copyright (C) 2003 Jens Axboe + * + * Copyright (C) 2008 Fabio Checconi + * Paolo Valente + * + * Copyright (C) 2010 Paolo Valente + */ + +#ifdef CONFIG_BFQ_GROUP_IOSCHED +#define for_each_entity(entity) \ + for (; entity ; entity = entity->parent) + +#define for_each_entity_safe(entity, parent) \ + for (; entity && ({ parent = entity->parent; 1; }); entity = parent) + + +static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd, + int extract, + struct bfq_data *bfqd); + +static struct bfq_group *bfqq_group(struct bfq_queue *bfqq); + +static void bfq_update_budget(struct bfq_entity *next_in_service) +{ + struct bfq_entity *bfqg_entity; + struct bfq_group *bfqg; + struct bfq_sched_data *group_sd; + + BUG_ON(!next_in_service); + + group_sd = next_in_service->sched_data; + + bfqg = container_of(group_sd, struct bfq_group, sched_data); + /* + * bfq_group's my_entity field is not NULL only if the group + * is not the root group. We must not touch the root entity + * as it must never become an in-service entity. + */ + bfqg_entity = bfqg->my_entity; + if (bfqg_entity) + bfqg_entity->budget = next_in_service->budget; +} + +static int bfq_update_next_in_service(struct bfq_sched_data *sd) +{ + struct bfq_entity *next_in_service; + + if (sd->in_service_entity) + /* will update/requeue at the end of service */ + return 0; + + /* + * NOTE: this can be improved in many ways, such as returning + * 1 (and thus propagating upwards the update) only when the + * budget changes, or caching the bfqq that will be scheduled + * next from this subtree. By now we worry more about + * correctness than about performance... + */ + next_in_service = bfq_lookup_next_entity(sd, 0, NULL); + sd->next_in_service = next_in_service; + + if (next_in_service) + bfq_update_budget(next_in_service); + + return 1; +} + +static void bfq_check_next_in_service(struct bfq_sched_data *sd, + struct bfq_entity *entity) +{ + BUG_ON(sd->next_in_service != entity); +} +#else +#define for_each_entity(entity) \ + for (; entity ; entity = NULL) + +#define for_each_entity_safe(entity, parent) \ + for (parent = NULL; entity ; entity = parent) + +static int bfq_update_next_in_service(struct bfq_sched_data *sd) +{ + return 0; +} + +static void bfq_check_next_in_service(struct bfq_sched_data *sd, + struct bfq_entity *entity) +{ +} + +static void bfq_update_budget(struct bfq_entity *next_in_service) +{ +} +#endif + +/* + * Shift for timestamp calculations. This actually limits the maximum + * service allowed in one timestamp delta (small shift values increase it), + * the maximum total weight that can be used for the queues in the system + * (big shift values increase it), and the period of virtual time + * wraparounds. + */ +#define WFQ_SERVICE_SHIFT 22 + +/** + * bfq_gt - compare two timestamps. + * @a: first ts. + * @b: second ts. + * + * Return @a > @b, dealing with wrapping correctly. + */ +static int bfq_gt(u64 a, u64 b) +{ + return (s64)(a - b) > 0; +} + +static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = NULL; + + BUG_ON(!entity); + + if (!entity->my_sched_data) + bfqq = container_of(entity, struct bfq_queue, entity); + + return bfqq; +} + + +/** + * bfq_delta - map service into the virtual time domain. + * @service: amount of service. + * @weight: scale factor (weight of an entity or weight sum). + */ +static u64 bfq_delta(unsigned long service, unsigned long weight) +{ + u64 d = (u64)service << WFQ_SERVICE_SHIFT; + + do_div(d, weight); + return d; +} + +/** + * bfq_calc_finish - assign the finish time to an entity. + * @entity: the entity to act upon. + * @service: the service to be charged to the entity. + */ +static void bfq_calc_finish(struct bfq_entity *entity, unsigned long service) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + BUG_ON(entity->weight == 0); + + entity->finish = entity->start + + bfq_delta(service, entity->weight); + + if (bfqq) { + bfq_log_bfqq(bfqq->bfqd, bfqq, + "calc_finish: serv %lu, w %d", + service, entity->weight); + bfq_log_bfqq(bfqq->bfqd, bfqq, + "calc_finish: start %llu, finish %llu, delta %llu", + entity->start, entity->finish, + bfq_delta(service, entity->weight)); + } +} + +/** + * bfq_entity_of - get an entity from a node. + * @node: the node field of the entity. + * + * Convert a node pointer to the relative entity. This is used only + * to simplify the logic of some functions and not as the generic + * conversion mechanism because, e.g., in the tree walking functions, + * the check for a %NULL value would be redundant. + */ +static struct bfq_entity *bfq_entity_of(struct rb_node *node) +{ + struct bfq_entity *entity = NULL; + + if (node) + entity = rb_entry(node, struct bfq_entity, rb_node); + + return entity; +} + +/** + * bfq_extract - remove an entity from a tree. + * @root: the tree root. + * @entity: the entity to remove. + */ +static void bfq_extract(struct rb_root *root, struct bfq_entity *entity) +{ + BUG_ON(entity->tree != root); + + entity->tree = NULL; + rb_erase(&entity->rb_node, root); +} + +/** + * bfq_idle_extract - extract an entity from the idle tree. + * @st: the service tree of the owning @entity. + * @entity: the entity being removed. + */ +static void bfq_idle_extract(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct rb_node *next; + + BUG_ON(entity->tree != &st->idle); + + if (entity == st->first_idle) { + next = rb_next(&entity->rb_node); + st->first_idle = bfq_entity_of(next); + } + + if (entity == st->last_idle) { + next = rb_prev(&entity->rb_node); + st->last_idle = bfq_entity_of(next); + } + + bfq_extract(&st->idle, entity); + + if (bfqq) + list_del(&bfqq->bfqq_list); +} + +/** + * bfq_insert - generic tree insertion. + * @root: tree root. + * @entity: entity to insert. + * + * This is used for the idle and the active tree, since they are both + * ordered by finish time. + */ +static void bfq_insert(struct rb_root *root, struct bfq_entity *entity) +{ + struct bfq_entity *entry; + struct rb_node **node = &root->rb_node; + struct rb_node *parent = NULL; + + BUG_ON(entity->tree); + + while (*node) { + parent = *node; + entry = rb_entry(parent, struct bfq_entity, rb_node); + + if (bfq_gt(entry->finish, entity->finish)) + node = &parent->rb_left; + else + node = &parent->rb_right; + } + + rb_link_node(&entity->rb_node, parent, node); + rb_insert_color(&entity->rb_node, root); + + entity->tree = root; +} + +/** + * bfq_update_min - update the min_start field of a entity. + * @entity: the entity to update. + * @node: one of its children. + * + * This function is called when @entity may store an invalid value for + * min_start due to updates to the active tree. The function assumes + * that the subtree rooted at @node (which may be its left or its right + * child) has a valid min_start value. + */ +static void bfq_update_min(struct bfq_entity *entity, struct rb_node *node) +{ + struct bfq_entity *child; + + if (node) { + child = rb_entry(node, struct bfq_entity, rb_node); + if (bfq_gt(entity->min_start, child->min_start)) + entity->min_start = child->min_start; + } +} + +/** + * bfq_update_active_node - recalculate min_start. + * @node: the node to update. + * + * @node may have changed position or one of its children may have moved, + * this function updates its min_start value. The left and right subtrees + * are assumed to hold a correct min_start value. + */ +static void bfq_update_active_node(struct rb_node *node) +{ + struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node); + + entity->min_start = entity->start; + bfq_update_min(entity, node->rb_right); + bfq_update_min(entity, node->rb_left); +} + +/** + * bfq_update_active_tree - update min_start for the whole active tree. + * @node: the starting node. + * + * @node must be the deepest modified node after an update. This function + * updates its min_start using the values held by its children, assuming + * that they did not change, and then updates all the nodes that may have + * changed in the path to the root. The only nodes that may have changed + * are the ones in the path or their siblings. + */ +static void bfq_update_active_tree(struct rb_node *node) +{ + struct rb_node *parent; + +up: + bfq_update_active_node(node); + + parent = rb_parent(node); + if (!parent) + return; + + if (node == parent->rb_left && parent->rb_right) + bfq_update_active_node(parent->rb_right); + else if (parent->rb_left) + bfq_update_active_node(parent->rb_left); + + node = parent; + goto up; +} + +static void bfq_weights_tree_add(struct bfq_data *bfqd, + struct bfq_entity *entity, + struct rb_root *root); + +static void bfq_weights_tree_remove(struct bfq_data *bfqd, + struct bfq_entity *entity, + struct rb_root *root); + + +/** + * bfq_active_insert - insert an entity in the active tree of its + * group/device. + * @st: the service tree of the entity. + * @entity: the entity being inserted. + * + * The active tree is ordered by finish time, but an extra key is kept + * per each node, containing the minimum value for the start times of + * its children (and the node itself), so it's possible to search for + * the eligible node with the lowest finish time in logarithmic time. + */ +static void bfq_active_insert(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct rb_node *node = &entity->rb_node; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_sched_data *sd = NULL; + struct bfq_group *bfqg = NULL; + struct bfq_data *bfqd = NULL; +#endif + + bfq_insert(&st->active, entity); + + if (node->rb_left) + node = node->rb_left; + else if (node->rb_right) + node = node->rb_right; + + bfq_update_active_tree(node); + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + sd = entity->sched_data; + bfqg = container_of(sd, struct bfq_group, sched_data); + BUG_ON(!bfqg); + bfqd = (struct bfq_data *)bfqg->bfqd; +#endif + if (bfqq) + list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + else { /* bfq_group */ + BUG_ON(!bfqd); + bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree); + } + if (bfqg != bfqd->root_group) { + BUG_ON(!bfqg); + BUG_ON(!bfqd); + bfqg->active_entities++; + if (bfqg->active_entities == 2) + bfqd->active_numerous_groups++; + } +#endif +} + +/** + * bfq_ioprio_to_weight - calc a weight from an ioprio. + * @ioprio: the ioprio value to convert. + */ +static unsigned short bfq_ioprio_to_weight(int ioprio) +{ + BUG_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR); + return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - ioprio; +} + +/** + * bfq_weight_to_ioprio - calc an ioprio from a weight. + * @weight: the weight value to convert. + * + * To preserve as much as possible the old only-ioprio user interface, + * 0 is used as an escape ioprio value for weights (numerically) equal or + * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF. + */ +static unsigned short bfq_weight_to_ioprio(int weight) +{ + BUG_ON(weight < BFQ_MIN_WEIGHT || weight > BFQ_MAX_WEIGHT); + return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ? + 0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight; +} + +static void bfq_get_entity(struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + if (bfqq) { + atomic_inc(&bfqq->ref); + bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d", + bfqq, atomic_read(&bfqq->ref)); + } +} + +/** + * bfq_find_deepest - find the deepest node that an extraction can modify. + * @node: the node being removed. + * + * Do the first step of an extraction in an rb tree, looking for the + * node that will replace @node, and returning the deepest node that + * the following modifications to the tree can touch. If @node is the + * last node in the tree return %NULL. + */ +static struct rb_node *bfq_find_deepest(struct rb_node *node) +{ + struct rb_node *deepest; + + if (!node->rb_right && !node->rb_left) + deepest = rb_parent(node); + else if (!node->rb_right) + deepest = node->rb_left; + else if (!node->rb_left) + deepest = node->rb_right; + else { + deepest = rb_next(node); + if (deepest->rb_right) + deepest = deepest->rb_right; + else if (rb_parent(deepest) != node) + deepest = rb_parent(deepest); + } + + return deepest; +} + +/** + * bfq_active_extract - remove an entity from the active tree. + * @st: the service_tree containing the tree. + * @entity: the entity being removed. + */ +static void bfq_active_extract(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct rb_node *node; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_sched_data *sd = NULL; + struct bfq_group *bfqg = NULL; + struct bfq_data *bfqd = NULL; +#endif + + node = bfq_find_deepest(&entity->rb_node); + bfq_extract(&st->active, entity); + + if (node) + bfq_update_active_tree(node); + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + sd = entity->sched_data; + bfqg = container_of(sd, struct bfq_group, sched_data); + BUG_ON(!bfqg); + bfqd = (struct bfq_data *)bfqg->bfqd; +#endif + if (bfqq) + list_del(&bfqq->bfqq_list); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + else { /* bfq_group */ + BUG_ON(!bfqd); + bfq_weights_tree_remove(bfqd, entity, + &bfqd->group_weights_tree); + } + if (bfqg != bfqd->root_group) { + BUG_ON(!bfqg); + BUG_ON(!bfqd); + BUG_ON(!bfqg->active_entities); + bfqg->active_entities--; + if (bfqg->active_entities == 1) { + BUG_ON(!bfqd->active_numerous_groups); + bfqd->active_numerous_groups--; + } + } +#endif +} + +/** + * bfq_idle_insert - insert an entity into the idle tree. + * @st: the service tree containing the tree. + * @entity: the entity to insert. + */ +static void bfq_idle_insert(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct bfq_entity *first_idle = st->first_idle; + struct bfq_entity *last_idle = st->last_idle; + + if (!first_idle || bfq_gt(first_idle->finish, entity->finish)) + st->first_idle = entity; + if (!last_idle || bfq_gt(entity->finish, last_idle->finish)) + st->last_idle = entity; + + bfq_insert(&st->idle, entity); + + if (bfqq) + list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list); +} + +/** + * bfq_forget_entity - remove an entity from the wfq trees. + * @st: the service tree. + * @entity: the entity being removed. + * + * Update the device status and forget everything about @entity, putting + * the device reference to it, if it is a queue. Entities belonging to + * groups are not refcounted. + */ +static void bfq_forget_entity(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct bfq_sched_data *sd; + + BUG_ON(!entity->on_st); + + entity->on_st = 0; + st->wsum -= entity->weight; + if (bfqq) { + sd = entity->sched_data; + bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d", + bfqq, atomic_read(&bfqq->ref)); + bfq_put_queue(bfqq); + } +} + +/** + * bfq_put_idle_entity - release the idle tree ref of an entity. + * @st: service tree for the entity. + * @entity: the entity being released. + */ +static void bfq_put_idle_entity(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + bfq_idle_extract(st, entity); + bfq_forget_entity(st, entity); +} + +/** + * bfq_forget_idle - update the idle tree if necessary. + * @st: the service tree to act upon. + * + * To preserve the global O(log N) complexity we only remove one entry here; + * as the idle tree will not grow indefinitely this can be done safely. + */ +static void bfq_forget_idle(struct bfq_service_tree *st) +{ + struct bfq_entity *first_idle = st->first_idle; + struct bfq_entity *last_idle = st->last_idle; + + if (RB_EMPTY_ROOT(&st->active) && last_idle && + !bfq_gt(last_idle->finish, st->vtime)) { + /* + * Forget the whole idle tree, increasing the vtime past + * the last finish time of idle entities. + */ + st->vtime = last_idle->finish; + } + + if (first_idle && !bfq_gt(first_idle->finish, st->vtime)) + bfq_put_idle_entity(st, first_idle); +} + +static struct bfq_service_tree * +__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st, + struct bfq_entity *entity) +{ + struct bfq_service_tree *new_st = old_st; + + if (entity->prio_changed) { + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + unsigned short prev_weight, new_weight; + struct bfq_data *bfqd = NULL; + struct rb_root *root; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_sched_data *sd; + struct bfq_group *bfqg; +#endif + + if (bfqq) + bfqd = bfqq->bfqd; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + else { + sd = entity->my_sched_data; + bfqg = container_of(sd, struct bfq_group, sched_data); + BUG_ON(!bfqg); + bfqd = (struct bfq_data *)bfqg->bfqd; + BUG_ON(!bfqd); + } +#endif + + BUG_ON(old_st->wsum < entity->weight); + old_st->wsum -= entity->weight; + + if (entity->new_weight != entity->orig_weight) { + if (entity->new_weight < BFQ_MIN_WEIGHT || + entity->new_weight > BFQ_MAX_WEIGHT) { + printk(KERN_CRIT "update_weight_prio: " + "new_weight %d\n", + entity->new_weight); + BUG(); + } + entity->orig_weight = entity->new_weight; + if (bfqq) + bfqq->ioprio = + bfq_weight_to_ioprio(entity->orig_weight); + } + + if (bfqq) + bfqq->ioprio_class = bfqq->new_ioprio_class; + entity->prio_changed = 0; + + /* + * NOTE: here we may be changing the weight too early, + * this will cause unfairness. The correct approach + * would have required additional complexity to defer + * weight changes to the proper time instants (i.e., + * when entity->finish <= old_st->vtime). + */ + new_st = bfq_entity_service_tree(entity); + + prev_weight = entity->weight; + new_weight = entity->orig_weight * + (bfqq ? bfqq->wr_coeff : 1); + /* + * If the weight of the entity changes, remove the entity + * from its old weight counter (if there is a counter + * associated with the entity), and add it to the counter + * associated with its new weight. + */ + if (prev_weight != new_weight) { + root = bfqq ? &bfqd->queue_weights_tree : + &bfqd->group_weights_tree; + bfq_weights_tree_remove(bfqd, entity, root); + } + entity->weight = new_weight; + /* + * Add the entity to its weights tree only if it is + * not associated with a weight-raised queue. + */ + if (prev_weight != new_weight && + (bfqq ? bfqq->wr_coeff == 1 : 1)) + /* If we get here, root has been initialized. */ + bfq_weights_tree_add(bfqd, entity, root); + + new_st->wsum += entity->weight; + + if (new_st != old_st) + entity->start = new_st->vtime; + } + + return new_st; +} + +#ifdef CONFIG_BFQ_GROUP_IOSCHED +static void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg); +#endif + +/** + * bfq_bfqq_served - update the scheduler status after selection for + * service. + * @bfqq: the queue being served. + * @served: bytes to transfer. + * + * NOTE: this can be optimized, as the timestamps of upper level entities + * are synchronized every time a new bfqq is selected for service. By now, + * we keep it to better check consistency. + */ +static void bfq_bfqq_served(struct bfq_queue *bfqq, int served) +{ + struct bfq_entity *entity = &bfqq->entity; + struct bfq_service_tree *st; + + for_each_entity(entity) { + st = bfq_entity_service_tree(entity); + + entity->service += served; + BUG_ON(entity->service > entity->budget); + BUG_ON(st->wsum == 0); + + st->vtime += bfq_delta(served, st->wsum); + bfq_forget_idle(st); + } +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_set_start_empty_time(bfqq_group(bfqq)); +#endif + bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %d secs", served); +} + +/** + * bfq_bfqq_charge_full_budget - set the service to the entity budget. + * @bfqq: the queue that needs a service update. + * + * When it's not possible to be fair in the service domain, because + * a queue is not consuming its budget fast enough (the meaning of + * fast depends on the timeout parameter), we charge it a full + * budget. In this way we should obtain a sort of time-domain + * fairness among all the seeky/slow queues. + */ +static void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + + bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget"); + + bfq_bfqq_served(bfqq, entity->budget - entity->service); +} + +/** + * __bfq_activate_entity - activate an entity. + * @entity: the entity being activated. + * + * Called whenever an entity is activated, i.e., it is not active and one + * of its children receives a new request, or has to be reactivated due to + * budget exhaustion. It uses the current budget of the entity (and the + * service received if @entity is active) of the queue to calculate its + * timestamps. + */ +static void __bfq_activate_entity(struct bfq_entity *entity) +{ + struct bfq_sched_data *sd = entity->sched_data; + struct bfq_service_tree *st = bfq_entity_service_tree(entity); + + if (entity == sd->in_service_entity) { + BUG_ON(entity->tree); + /* + * If we are requeueing the current entity we have + * to take care of not charging to it service it has + * not received. + */ + bfq_calc_finish(entity, entity->service); + entity->start = entity->finish; + sd->in_service_entity = NULL; + } else if (entity->tree == &st->active) { + /* + * Requeueing an entity due to a change of some + * next_in_service entity below it. We reuse the + * old start time. + */ + bfq_active_extract(st, entity); + } else if (entity->tree == &st->idle) { + /* + * Must be on the idle tree, bfq_idle_extract() will + * check for that. + */ + bfq_idle_extract(st, entity); + entity->start = bfq_gt(st->vtime, entity->finish) ? + st->vtime : entity->finish; + } else { + /* + * The finish time of the entity may be invalid, and + * it is in the past for sure, otherwise the queue + * would have been on the idle tree. + */ + entity->start = st->vtime; + st->wsum += entity->weight; + bfq_get_entity(entity); + + BUG_ON(entity->on_st); + entity->on_st = 1; + } + + st = __bfq_entity_update_weight_prio(st, entity); + bfq_calc_finish(entity, entity->budget); + bfq_active_insert(st, entity); +} + +/** + * bfq_activate_entity - activate an entity and its ancestors if necessary. + * @entity: the entity to activate. + * + * Activate @entity and all the entities on the path from it to the root. + */ +static void bfq_activate_entity(struct bfq_entity *entity) +{ + struct bfq_sched_data *sd; + + for_each_entity(entity) { + __bfq_activate_entity(entity); + + sd = entity->sched_data; + if (!bfq_update_next_in_service(sd)) + /* + * No need to propagate the activation to the + * upper entities, as they will be updated when + * the in-service entity is rescheduled. + */ + break; + } +} + +/** + * __bfq_deactivate_entity - deactivate an entity from its service tree. + * @entity: the entity to deactivate. + * @requeue: if false, the entity will not be put into the idle tree. + * + * Deactivate an entity, independently from its previous state. If the + * entity was not on a service tree just return, otherwise if it is on + * any scheduler tree, extract it from that tree, and if necessary + * and if the caller did not specify @requeue, put it on the idle tree. + * + * Return %1 if the caller should update the entity hierarchy, i.e., + * if the entity was in service or if it was the next_in_service for + * its sched_data; return %0 otherwise. + */ +static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue) +{ + struct bfq_sched_data *sd = entity->sched_data; + struct bfq_service_tree *st; + int was_in_service; + int ret = 0; + + if (sd == NULL || !entity->on_st) /* never activated, or inactive */ + return 0; + + st = bfq_entity_service_tree(entity); + was_in_service = entity == sd->in_service_entity; + + BUG_ON(was_in_service && entity->tree); + + if (was_in_service) { + bfq_calc_finish(entity, entity->service); + sd->in_service_entity = NULL; + } else if (entity->tree == &st->active) + bfq_active_extract(st, entity); + else if (entity->tree == &st->idle) + bfq_idle_extract(st, entity); + else if (entity->tree) + BUG(); + + if (was_in_service || sd->next_in_service == entity) + ret = bfq_update_next_in_service(sd); + + if (!requeue || !bfq_gt(entity->finish, st->vtime)) + bfq_forget_entity(st, entity); + else + bfq_idle_insert(st, entity); + + BUG_ON(sd->in_service_entity == entity); + BUG_ON(sd->next_in_service == entity); + + return ret; +} + +/** + * bfq_deactivate_entity - deactivate an entity. + * @entity: the entity to deactivate. + * @requeue: true if the entity can be put on the idle tree + */ +static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue) +{ + struct bfq_sched_data *sd; + struct bfq_entity *parent; + + for_each_entity_safe(entity, parent) { + sd = entity->sched_data; + + if (!__bfq_deactivate_entity(entity, requeue)) + /* + * The parent entity is still backlogged, and + * we don't need to update it as it is still + * in service. + */ + break; + + if (sd->next_in_service) + /* + * The parent entity is still backlogged and + * the budgets on the path towards the root + * need to be updated. + */ + goto update; + + /* + * If we reach there the parent is no more backlogged and + * we want to propagate the dequeue upwards. + */ + requeue = 1; + } + + return; + +update: + entity = parent; + for_each_entity(entity) { + __bfq_activate_entity(entity); + + sd = entity->sched_data; + if (!bfq_update_next_in_service(sd)) + break; + } +} + +/** + * bfq_update_vtime - update vtime if necessary. + * @st: the service tree to act upon. + * + * If necessary update the service tree vtime to have at least one + * eligible entity, skipping to its start time. Assumes that the + * active tree of the device is not empty. + * + * NOTE: this hierarchical implementation updates vtimes quite often, + * we may end up with reactivated processes getting timestamps after a + * vtime skip done because we needed a ->first_active entity on some + * intermediate node. + */ +static void bfq_update_vtime(struct bfq_service_tree *st) +{ + struct bfq_entity *entry; + struct rb_node *node = st->active.rb_node; + + entry = rb_entry(node, struct bfq_entity, rb_node); + if (bfq_gt(entry->min_start, st->vtime)) { + st->vtime = entry->min_start; + bfq_forget_idle(st); + } +} + +/** + * bfq_first_active_entity - find the eligible entity with + * the smallest finish time + * @st: the service tree to select from. + * + * This function searches the first schedulable entity, starting from the + * root of the tree and going on the left every time on this side there is + * a subtree with at least one eligible (start >= vtime) entity. The path on + * the right is followed only if a) the left subtree contains no eligible + * entities and b) no eligible entity has been found yet. + */ +static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st) +{ + struct bfq_entity *entry, *first = NULL; + struct rb_node *node = st->active.rb_node; + + while (node) { + entry = rb_entry(node, struct bfq_entity, rb_node); +left: + if (!bfq_gt(entry->start, st->vtime)) + first = entry; + + BUG_ON(bfq_gt(entry->min_start, st->vtime)); + + if (node->rb_left) { + entry = rb_entry(node->rb_left, + struct bfq_entity, rb_node); + if (!bfq_gt(entry->min_start, st->vtime)) { + node = node->rb_left; + goto left; + } + } + if (first) + break; + node = node->rb_right; + } + + BUG_ON(!first && !RB_EMPTY_ROOT(&st->active)); + return first; +} + +/** + * __bfq_lookup_next_entity - return the first eligible entity in @st. + * @st: the service tree. + * + * Update the virtual time in @st and return the first eligible entity + * it contains. + */ +static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st, + bool force) +{ + struct bfq_entity *entity, *new_next_in_service = NULL; + + if (RB_EMPTY_ROOT(&st->active)) + return NULL; + + bfq_update_vtime(st); + entity = bfq_first_active_entity(st); + BUG_ON(bfq_gt(entity->start, st->vtime)); + + /* + * If the chosen entity does not match with the sched_data's + * next_in_service and we are forcedly serving the IDLE priority + * class tree, bubble up budget update. + */ + if (unlikely(force && entity != entity->sched_data->next_in_service)) { + new_next_in_service = entity; + for_each_entity(new_next_in_service) + bfq_update_budget(new_next_in_service); + } + + return entity; +} + +/** + * bfq_lookup_next_entity - return the first eligible entity in @sd. + * @sd: the sched_data. + * @extract: if true the returned entity will be also extracted from @sd. + * + * NOTE: since we cache the next_in_service entity at each level of the + * hierarchy, the complexity of the lookup can be decreased with + * absolutely no effort just returning the cached next_in_service value; + * we prefer to do full lookups to test the consistency of * the data + * structures. + */ +static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd, + int extract, + struct bfq_data *bfqd) +{ + struct bfq_service_tree *st = sd->service_tree; + struct bfq_entity *entity; + int i = 0; + + BUG_ON(sd->in_service_entity); + + if (bfqd && + jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) { + entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1, + true); + if (entity) { + i = BFQ_IOPRIO_CLASSES - 1; + bfqd->bfq_class_idle_last_service = jiffies; + sd->next_in_service = entity; + } + } + for (; i < BFQ_IOPRIO_CLASSES; i++) { + entity = __bfq_lookup_next_entity(st + i, false); + if (entity) { + if (extract) { + bfq_check_next_in_service(sd, entity); + bfq_active_extract(st + i, entity); + sd->in_service_entity = entity; + sd->next_in_service = NULL; + } + break; + } + } + + return entity; +} + +/* + * Get next queue for service. + */ +static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd) +{ + struct bfq_entity *entity = NULL; + struct bfq_sched_data *sd; + struct bfq_queue *bfqq; + + BUG_ON(bfqd->in_service_queue); + + if (bfqd->busy_queues == 0) + return NULL; + + sd = &bfqd->root_group->sched_data; + for (; sd ; sd = entity->my_sched_data) { + entity = bfq_lookup_next_entity(sd, 1, bfqd); + BUG_ON(!entity); + entity->service = 0; + } + + bfqq = bfq_entity_to_bfqq(entity); + BUG_ON(!bfqq); + + return bfqq; +} + +static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd) +{ + if (bfqd->in_service_bic) { + put_io_context(bfqd->in_service_bic->icq.ioc); + bfqd->in_service_bic = NULL; + } + + bfqd->in_service_queue = NULL; + del_timer(&bfqd->idle_slice_timer); +} + +static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, + int requeue) +{ + struct bfq_entity *entity = &bfqq->entity; + + if (bfqq == bfqd->in_service_queue) + __bfq_bfqd_reset_in_service(bfqd); + + bfq_deactivate_entity(entity, requeue); +} + +static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + + bfq_activate_entity(entity); +} + +#ifdef CONFIG_BFQ_GROUP_IOSCHED +static void bfqg_stats_update_dequeue(struct bfq_group *bfqg); +#endif + +/* + * Called when the bfqq no longer has requests pending, remove it from + * the service tree. + */ +static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq, + int requeue) +{ + BUG_ON(!bfq_bfqq_busy(bfqq)); + BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list)); + + bfq_log_bfqq(bfqd, bfqq, "del from busy"); + + bfq_clear_bfqq_busy(bfqq); + + BUG_ON(bfqd->busy_queues == 0); + bfqd->busy_queues--; + + if (!bfqq->dispatched) { + bfq_weights_tree_remove(bfqd, &bfqq->entity, + &bfqd->queue_weights_tree); + if (!blk_queue_nonrot(bfqd->queue)) { + BUG_ON(!bfqd->busy_in_flight_queues); + bfqd->busy_in_flight_queues--; + if (bfq_bfqq_constantly_seeky(bfqq)) { + BUG_ON(!bfqd-> + const_seeky_busy_in_flight_queues); + bfqd->const_seeky_busy_in_flight_queues--; + } + } + } + if (bfqq->wr_coeff > 1) + bfqd->wr_busy_queues--; + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_stats_update_dequeue(bfqq_group(bfqq)); +#endif + + bfq_deactivate_bfqq(bfqd, bfqq, requeue); +} + +/* + * Called when an inactive queue receives a new request. + */ +static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + BUG_ON(bfq_bfqq_busy(bfqq)); + BUG_ON(bfqq == bfqd->in_service_queue); + + bfq_log_bfqq(bfqd, bfqq, "add to busy"); + + bfq_activate_bfqq(bfqd, bfqq); + + bfq_mark_bfqq_busy(bfqq); + bfqd->busy_queues++; + + if (!bfqq->dispatched) { + if (bfqq->wr_coeff == 1) + bfq_weights_tree_add(bfqd, &bfqq->entity, + &bfqd->queue_weights_tree); + if (!blk_queue_nonrot(bfqd->queue)) { + bfqd->busy_in_flight_queues++; + if (bfq_bfqq_constantly_seeky(bfqq)) + bfqd->const_seeky_busy_in_flight_queues++; + } + } + if (bfqq->wr_coeff > 1) + bfqd->wr_busy_queues++; +} diff --git b/block/bfq.h b/block/bfq.h new file mode 100644 index 0000000..32dfcee --- /dev/null +++ b/block/bfq.h @@ -0,0 +1,867 @@ +/* + * BFQ-v7r11 for 4.4.0: data structures and common functions prototypes. + * + * Based on ideas and code from CFQ: + * Copyright (C) 2003 Jens Axboe + * + * Copyright (C) 2008 Fabio Checconi + * Paolo Valente + * + * Copyright (C) 2010 Paolo Valente + */ + +#ifndef _BFQ_H +#define _BFQ_H + +#include +#include +#include +#include +#include + +#define BFQ_IOPRIO_CLASSES 3 +#define BFQ_CL_IDLE_TIMEOUT (HZ/5) + +#define BFQ_MIN_WEIGHT 1 +#define BFQ_MAX_WEIGHT 1000 +#define BFQ_WEIGHT_CONVERSION_COEFF 10 + +#define BFQ_DEFAULT_QUEUE_IOPRIO 4 + +#define BFQ_DEFAULT_GRP_WEIGHT 10 +#define BFQ_DEFAULT_GRP_IOPRIO 0 +#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE + +struct bfq_entity; + +/** + * struct bfq_service_tree - per ioprio_class service tree. + * @active: tree for active entities (i.e., those backlogged). + * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i). + * @first_idle: idle entity with minimum F_i. + * @last_idle: idle entity with maximum F_i. + * @vtime: scheduler virtual time. + * @wsum: scheduler weight sum; active and idle entities contribute to it. + * + * Each service tree represents a B-WF2Q+ scheduler on its own. Each + * ioprio_class has its own independent scheduler, and so its own + * bfq_service_tree. All the fields are protected by the queue lock + * of the containing bfqd. + */ +struct bfq_service_tree { + struct rb_root active; + struct rb_root idle; + + struct bfq_entity *first_idle; + struct bfq_entity *last_idle; + + u64 vtime; + unsigned long wsum; +}; + +/** + * struct bfq_sched_data - multi-class scheduler. + * @in_service_entity: entity in service. + * @next_in_service: head-of-the-line entity in the scheduler. + * @service_tree: array of service trees, one per ioprio_class. + * + * bfq_sched_data is the basic scheduler queue. It supports three + * ioprio_classes, and can be used either as a toplevel queue or as + * an intermediate queue on a hierarchical setup. + * @next_in_service points to the active entity of the sched_data + * service trees that will be scheduled next. + * + * The supported ioprio_classes are the same as in CFQ, in descending + * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE. + * Requests from higher priority queues are served before all the + * requests from lower priority queues; among requests of the same + * queue requests are served according to B-WF2Q+. + * All the fields are protected by the queue lock of the containing bfqd. + */ +struct bfq_sched_data { + struct bfq_entity *in_service_entity; + struct bfq_entity *next_in_service; + struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES]; +}; + +/** + * struct bfq_weight_counter - counter of the number of all active entities + * with a given weight. + * @weight: weight of the entities that this counter refers to. + * @num_active: number of active entities with this weight. + * @weights_node: weights tree member (see bfq_data's @queue_weights_tree + * and @group_weights_tree). + */ +struct bfq_weight_counter { + short int weight; + unsigned int num_active; + struct rb_node weights_node; +}; + +/** + * struct bfq_entity - schedulable entity. + * @rb_node: service_tree member. + * @weight_counter: pointer to the weight counter associated with this entity. + * @on_st: flag, true if the entity is on a tree (either the active or + * the idle one of its service_tree). + * @finish: B-WF2Q+ finish timestamp (aka F_i). + * @start: B-WF2Q+ start timestamp (aka S_i). + * @tree: tree the entity is enqueued into; %NULL if not on a tree. + * @min_start: minimum start time of the (active) subtree rooted at + * this entity; used for O(log N) lookups into active trees. + * @service: service received during the last round of service. + * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight. + * @weight: weight of the queue + * @parent: parent entity, for hierarchical scheduling. + * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the + * associated scheduler queue, %NULL on leaf nodes. + * @sched_data: the scheduler queue this entity belongs to. + * @ioprio: the ioprio in use. + * @new_weight: when a weight change is requested, the new weight value. + * @orig_weight: original weight, used to implement weight boosting + * @prio_changed: flag, true when the user requested a weight, ioprio or + * ioprio_class change. + * + * A bfq_entity is used to represent either a bfq_queue (leaf node in the + * cgroup hierarchy) or a bfq_group into the upper level scheduler. Each + * entity belongs to the sched_data of the parent group in the cgroup + * hierarchy. Non-leaf entities have also their own sched_data, stored + * in @my_sched_data. + * + * Each entity stores independently its priority values; this would + * allow different weights on different devices, but this + * functionality is not exported to userspace by now. Priorities and + * weights are updated lazily, first storing the new values into the + * new_* fields, then setting the @prio_changed flag. As soon as + * there is a transition in the entity state that allows the priority + * update to take place the effective and the requested priority + * values are synchronized. + * + * Unless cgroups are used, the weight value is calculated from the + * ioprio to export the same interface as CFQ. When dealing with + * ``well-behaved'' queues (i.e., queues that do not spend too much + * time to consume their budget and have true sequential behavior, and + * when there are no external factors breaking anticipation) the + * relative weights at each level of the cgroups hierarchy should be + * guaranteed. All the fields are protected by the queue lock of the + * containing bfqd. + */ +struct bfq_entity { + struct rb_node rb_node; + struct bfq_weight_counter *weight_counter; + + int on_st; + + u64 finish; + u64 start; + + struct rb_root *tree; + + u64 min_start; + + int service, budget; + unsigned short weight, new_weight; + unsigned short orig_weight; + + struct bfq_entity *parent; + + struct bfq_sched_data *my_sched_data; + struct bfq_sched_data *sched_data; + + int prio_changed; +}; + +struct bfq_group; + +/** + * struct bfq_queue - leaf schedulable entity. + * @ref: reference counter. + * @bfqd: parent bfq_data. + * @new_ioprio: when an ioprio change is requested, the new ioprio value. + * @ioprio_class: the ioprio_class in use. + * @new_ioprio_class: when an ioprio_class change is requested, the new + * ioprio_class value. + * @new_bfqq: shared bfq_queue if queue is cooperating with + * one or more other queues. + * @pos_node: request-position tree member (see bfq_group's @rq_pos_tree). + * @pos_root: request-position tree root (see bfq_group's @rq_pos_tree). + * @sort_list: sorted list of pending requests. + * @next_rq: if fifo isn't expired, next request to serve. + * @queued: nr of requests queued in @sort_list. + * @allocated: currently allocated requests. + * @meta_pending: pending metadata requests. + * @fifo: fifo list of requests in sort_list. + * @entity: entity representing this queue in the scheduler. + * @max_budget: maximum budget allowed from the feedback mechanism. + * @budget_timeout: budget expiration (in jiffies). + * @dispatched: number of requests on the dispatch list or inside driver. + * @flags: status flags. + * @bfqq_list: node for active/idle bfqq list inside our bfqd. + * @burst_list_node: node for the device's burst list. + * @seek_samples: number of seeks sampled + * @seek_total: sum of the distances of the seeks sampled + * @seek_mean: mean seek distance + * @last_request_pos: position of the last request enqueued + * @requests_within_timer: number of consecutive pairs of request completion + * and arrival, such that the queue becomes idle + * after the completion, but the next request arrives + * within an idle time slice; used only if the queue's + * IO_bound has been cleared. + * @pid: pid of the process owning the queue, used for logging purposes. + * @last_wr_start_finish: start time of the current weight-raising period if + * the @bfq-queue is being weight-raised, otherwise + * finish time of the last weight-raising period + * @wr_cur_max_time: current max raising time for this queue + * @soft_rt_next_start: minimum time instant such that, only if a new + * request is enqueued after this time instant in an + * idle @bfq_queue with no outstanding requests, then + * the task associated with the queue it is deemed as + * soft real-time (see the comments to the function + * bfq_bfqq_softrt_next_start()) + * @last_idle_bklogged: time of the last transition of the @bfq_queue from + * idle to backlogged + * @service_from_backlogged: cumulative service received from the @bfq_queue + * since the last transition from idle to + * backlogged + * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the + * queue is shared + * + * A bfq_queue is a leaf request queue; it can be associated with an + * io_context or more, if it is async or shared between cooperating + * processes. @cgroup holds a reference to the cgroup, to be sure that it + * does not disappear while a bfqq still references it (mostly to avoid + * races between request issuing and task migration followed by cgroup + * destruction). + * All the fields are protected by the queue lock of the containing bfqd. + */ +struct bfq_queue { + atomic_t ref; + struct bfq_data *bfqd; + + unsigned short ioprio, new_ioprio; + unsigned short ioprio_class, new_ioprio_class; + + /* fields for cooperating queues handling */ + struct bfq_queue *new_bfqq; + struct rb_node pos_node; + struct rb_root *pos_root; + + struct rb_root sort_list; + struct request *next_rq; + int queued[2]; + int allocated[2]; + int meta_pending; + struct list_head fifo; + + struct bfq_entity entity; + + int max_budget; + unsigned long budget_timeout; + + int dispatched; + + unsigned int flags; + + struct list_head bfqq_list; + + struct hlist_node burst_list_node; + + unsigned int seek_samples; + u64 seek_total; + sector_t seek_mean; + sector_t last_request_pos; + + unsigned int requests_within_timer; + + pid_t pid; + struct bfq_io_cq *bic; + + /* weight-raising fields */ + unsigned long wr_cur_max_time; + unsigned long soft_rt_next_start; + unsigned long last_wr_start_finish; + unsigned int wr_coeff; + unsigned long last_idle_bklogged; + unsigned long service_from_backlogged; +}; + +/** + * struct bfq_ttime - per process thinktime stats. + * @ttime_total: total process thinktime + * @ttime_samples: number of thinktime samples + * @ttime_mean: average process thinktime + */ +struct bfq_ttime { + unsigned long last_end_request; + + unsigned long ttime_total; + unsigned long ttime_samples; + unsigned long ttime_mean; +}; + +/** + * struct bfq_io_cq - per (request_queue, io_context) structure. + * @icq: associated io_cq structure + * @bfqq: array of two process queues, the sync and the async + * @ttime: associated @bfq_ttime struct + * @ioprio: per (request_queue, blkcg) ioprio. + * @blkcg_id: id of the blkcg the related io_cq belongs to. + * @wr_time_left: snapshot of the time left before weight raising ends + * for the sync queue associated to this process; this + * snapshot is taken to remember this value while the weight + * raising is suspended because the queue is merged with a + * shared queue, and is used to set @raising_cur_max_time + * when the queue is split from the shared queue and its + * weight is raised again + * @saved_idle_window: same purpose as the previous field for the idle + * window + * @saved_IO_bound: same purpose as the previous two fields for the I/O + * bound classification of a queue + * @saved_in_large_burst: same purpose as the previous fields for the + * value of the field keeping the queue's belonging + * to a large burst + * @was_in_burst_list: true if the queue belonged to a burst list + * before its merge with another cooperating queue + * @cooperations: counter of consecutive successful queue merges underwent + * by any of the process' @bfq_queues + * @failed_cooperations: counter of consecutive failed queue merges of any + * of the process' @bfq_queues + */ +struct bfq_io_cq { + struct io_cq icq; /* must be the first member */ + struct bfq_queue *bfqq[2]; + struct bfq_ttime ttime; + int ioprio; + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + uint64_t blkcg_id; /* the current blkcg ID */ +#endif + + unsigned int wr_time_left; + bool saved_idle_window; + bool saved_IO_bound; + + bool saved_in_large_burst; + bool was_in_burst_list; + + unsigned int cooperations; + unsigned int failed_cooperations; +}; + +enum bfq_device_speed { + BFQ_BFQD_FAST, + BFQ_BFQD_SLOW, +}; + +/** + * struct bfq_data - per device data structure. + * @queue: request queue for the managed device. + * @root_group: root bfq_group for the device. + * @active_numerous_groups: number of bfq_groups containing more than one + * active @bfq_entity. + * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by + * weight. Used to keep track of whether all @bfq_queues + * have the same weight. The tree contains one counter + * for each distinct weight associated to some active + * and not weight-raised @bfq_queue (see the comments to + * the functions bfq_weights_tree_[add|remove] for + * further details). + * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted + * by weight. Used to keep track of whether all + * @bfq_groups have the same weight. The tree contains + * one counter for each distinct weight associated to + * some active @bfq_group (see the comments to the + * functions bfq_weights_tree_[add|remove] for further + * details). + * @busy_queues: number of bfq_queues containing requests (including the + * queue in service, even if it is idling). + * @busy_in_flight_queues: number of @bfq_queues containing pending or + * in-flight requests, plus the @bfq_queue in + * service, even if idle but waiting for the + * possible arrival of its next sync request. This + * field is updated only if the device is rotational, + * but used only if the device is also NCQ-capable. + * The reason why the field is updated also for non- + * NCQ-capable rotational devices is related to the + * fact that the value of @hw_tag may be set also + * later than when busy_in_flight_queues may need to + * be incremented for the first time(s). Taking also + * this possibility into account, to avoid unbalanced + * increments/decrements, would imply more overhead + * than just updating busy_in_flight_queues + * regardless of the value of @hw_tag. + * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues + * (that is, seeky queues that expired + * for budget timeout at least once) + * containing pending or in-flight + * requests, including the in-service + * @bfq_queue if constantly seeky. This + * field is updated only if the device + * is rotational, but used only if the + * device is also NCQ-capable (see the + * comments to @busy_in_flight_queues). + * @wr_busy_queues: number of weight-raised busy @bfq_queues. + * @queued: number of queued requests. + * @rq_in_driver: number of requests dispatched and waiting for completion. + * @sync_flight: number of sync requests in the driver. + * @max_rq_in_driver: max number of reqs in driver in the last + * @hw_tag_samples completed requests. + * @hw_tag_samples: nr of samples used to calculate hw_tag. + * @hw_tag: flag set to one if the driver is showing a queueing behavior. + * @budgets_assigned: number of budgets assigned. + * @idle_slice_timer: timer set when idling for the next sequential request + * from the queue in service. + * @unplug_work: delayed work to restart dispatching on the request queue. + * @in_service_queue: bfq_queue in service. + * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue. + * @last_position: on-disk position of the last served request. + * @last_budget_start: beginning of the last budget. + * @last_idling_start: beginning of the last idle slice. + * @peak_rate: peak transfer rate observed for a budget. + * @peak_rate_samples: number of samples used to calculate @peak_rate. + * @bfq_max_budget: maximum budget allotted to a bfq_queue before + * rescheduling. + * @active_list: list of all the bfq_queues active on the device. + * @idle_list: list of all the bfq_queues idle on the device. + * @bfq_fifo_expire: timeout for async/sync requests; when it expires + * requests are served in fifo order. + * @bfq_back_penalty: weight of backward seeks wrt forward ones. + * @bfq_back_max: maximum allowed backward seek. + * @bfq_slice_idle: maximum idling time. + * @bfq_user_max_budget: user-configured max budget value + * (0 for auto-tuning). + * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to + * async queues. + * @bfq_timeout: timeout for bfq_queues to consume their budget; used to + * to prevent seeky queues to impose long latencies to well + * behaved ones (this also implies that seeky queues cannot + * receive guarantees in the service domain; after a timeout + * they are charged for the whole allocated budget, to try + * to preserve a behavior reasonably fair among them, but + * without service-domain guarantees). + * @bfq_coop_thresh: number of queue merges after which a @bfq_queue is + * no more granted any weight-raising. + * @bfq_failed_cooperations: number of consecutive failed cooperation + * chances after which weight-raising is restored + * to a queue subject to more than bfq_coop_thresh + * queue merges. + * @bfq_requests_within_timer: number of consecutive requests that must be + * issued within the idle time slice to set + * again idling to a queue which was marked as + * non-I/O-bound (see the definition of the + * IO_bound flag for further details). + * @last_ins_in_burst: last time at which a queue entered the current + * burst of queues being activated shortly after + * each other; for more details about this and the + * following parameters related to a burst of + * activations, see the comments to the function + * @bfq_handle_burst. + * @bfq_burst_interval: reference time interval used to decide whether a + * queue has been activated shortly after + * @last_ins_in_burst. + * @burst_size: number of queues in the current burst of queue activations. + * @bfq_large_burst_thresh: maximum burst size above which the current + * queue-activation burst is deemed as 'large'. + * @large_burst: true if a large queue-activation burst is in progress. + * @burst_list: head of the burst list (as for the above fields, more details + * in the comments to the function bfq_handle_burst). + * @low_latency: if set to true, low-latency heuristics are enabled. + * @bfq_wr_coeff: maximum factor by which the weight of a weight-raised + * queue is multiplied. + * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies). + * @bfq_wr_rt_max_time: maximum duration for soft real-time processes. + * @bfq_wr_min_idle_time: minimum idle period after which weight-raising + * may be reactivated for a queue (in jiffies). + * @bfq_wr_min_inter_arr_async: minimum period between request arrivals + * after which weight-raising may be + * reactivated for an already busy queue + * (in jiffies). + * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue, + * sectors per seconds. + * @RT_prod: cached value of the product R*T used for computing the maximum + * duration of the weight raising automatically. + * @device_speed: device-speed class for the low-latency heuristic. + * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions. + * + * All the fields are protected by the @queue lock. + */ +struct bfq_data { + struct request_queue *queue; + + struct bfq_group *root_group; + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + int active_numerous_groups; +#endif + + struct rb_root queue_weights_tree; + struct rb_root group_weights_tree; + + int busy_queues; + int busy_in_flight_queues; + int const_seeky_busy_in_flight_queues; + int wr_busy_queues; + int queued; + int rq_in_driver; + int sync_flight; + + int max_rq_in_driver; + int hw_tag_samples; + int hw_tag; + + int budgets_assigned; + + struct timer_list idle_slice_timer; + struct work_struct unplug_work; + + struct bfq_queue *in_service_queue; + struct bfq_io_cq *in_service_bic; + + sector_t last_position; + + ktime_t last_budget_start; + ktime_t last_idling_start; + int peak_rate_samples; + u64 peak_rate; + int bfq_max_budget; + + struct list_head active_list; + struct list_head idle_list; + + unsigned int bfq_fifo_expire[2]; + unsigned int bfq_back_penalty; + unsigned int bfq_back_max; + unsigned int bfq_slice_idle; + u64 bfq_class_idle_last_service; + + int bfq_user_max_budget; + int bfq_max_budget_async_rq; + unsigned int bfq_timeout[2]; + + unsigned int bfq_coop_thresh; + unsigned int bfq_failed_cooperations; + unsigned int bfq_requests_within_timer; + + unsigned long last_ins_in_burst; + unsigned long bfq_burst_interval; + int burst_size; + unsigned long bfq_large_burst_thresh; + bool large_burst; + struct hlist_head burst_list; + + bool low_latency; + + /* parameters of the low_latency heuristics */ + unsigned int bfq_wr_coeff; + unsigned int bfq_wr_max_time; + unsigned int bfq_wr_rt_max_time; + unsigned int bfq_wr_min_idle_time; + unsigned long bfq_wr_min_inter_arr_async; + unsigned int bfq_wr_max_softrt_rate; + u64 RT_prod; + enum bfq_device_speed device_speed; + + struct bfq_queue oom_bfqq; +}; + +enum bfqq_state_flags { + BFQ_BFQQ_FLAG_busy = 0, /* has requests or is in service */ + BFQ_BFQQ_FLAG_wait_request, /* waiting for a request */ + BFQ_BFQQ_FLAG_must_alloc, /* must be allowed rq alloc */ + BFQ_BFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */ + BFQ_BFQQ_FLAG_idle_window, /* slice idling enabled */ + BFQ_BFQQ_FLAG_sync, /* synchronous queue */ + BFQ_BFQQ_FLAG_budget_new, /* no completion with this budget */ + BFQ_BFQQ_FLAG_IO_bound, /* + * bfqq has timed-out at least once + * having consumed at most 2/10 of + * its budget + */ + BFQ_BFQQ_FLAG_in_large_burst, /* + * bfqq activated in a large burst, + * see comments to bfq_handle_burst. + */ + BFQ_BFQQ_FLAG_constantly_seeky, /* + * bfqq has proved to be slow and + * seeky until budget timeout + */ + BFQ_BFQQ_FLAG_softrt_update, /* + * may need softrt-next-start + * update + */ + BFQ_BFQQ_FLAG_coop, /* bfqq is shared */ + BFQ_BFQQ_FLAG_split_coop, /* shared bfqq will be split */ + BFQ_BFQQ_FLAG_just_split, /* queue has just been split */ +}; + +#define BFQ_BFQQ_FNS(name) \ +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \ +{ \ + (bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name); \ +} \ +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \ +{ \ + (bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name); \ +} \ +static int bfq_bfqq_##name(const struct bfq_queue *bfqq) \ +{ \ + return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0; \ +} + +BFQ_BFQQ_FNS(busy); +BFQ_BFQQ_FNS(wait_request); +BFQ_BFQQ_FNS(must_alloc); +BFQ_BFQQ_FNS(fifo_expire); +BFQ_BFQQ_FNS(idle_window); +BFQ_BFQQ_FNS(sync); +BFQ_BFQQ_FNS(budget_new); +BFQ_BFQQ_FNS(IO_bound); +BFQ_BFQQ_FNS(in_large_burst); +BFQ_BFQQ_FNS(constantly_seeky); +BFQ_BFQQ_FNS(coop); +BFQ_BFQQ_FNS(split_coop); +BFQ_BFQQ_FNS(just_split); +BFQ_BFQQ_FNS(softrt_update); +#undef BFQ_BFQQ_FNS + +/* Logging facilities. */ +#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \ + blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args) + +#define bfq_log(bfqd, fmt, args...) \ + blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args) + +/* Expiration reasons. */ +enum bfqq_expiration { + BFQ_BFQQ_TOO_IDLE = 0, /* + * queue has been idling for + * too long + */ + BFQ_BFQQ_BUDGET_TIMEOUT, /* budget took too long to be used */ + BFQ_BFQQ_BUDGET_EXHAUSTED, /* budget consumed */ + BFQ_BFQQ_NO_MORE_REQUESTS, /* the queue has no more requests */ +}; + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + +struct bfqg_stats { + /* total bytes transferred */ + struct blkg_rwstat service_bytes; + /* total IOs serviced, post merge */ + struct blkg_rwstat serviced; + /* number of ios merged */ + struct blkg_rwstat merged; + /* total time spent on device in ns, may not be accurate w/ queueing */ + struct blkg_rwstat service_time; + /* total time spent waiting in scheduler queue in ns */ + struct blkg_rwstat wait_time; + /* number of IOs queued up */ + struct blkg_rwstat queued; + /* total sectors transferred */ + struct blkg_stat sectors; + /* total disk time and nr sectors dispatched by this group */ + struct blkg_stat time; + /* time not charged to this cgroup */ + struct blkg_stat unaccounted_time; + /* sum of number of ios queued across all samples */ + struct blkg_stat avg_queue_size_sum; + /* count of samples taken for average */ + struct blkg_stat avg_queue_size_samples; + /* how many times this group has been removed from service tree */ + struct blkg_stat dequeue; + /* total time spent waiting for it to be assigned a timeslice. */ + struct blkg_stat group_wait_time; + /* time spent idling for this blkcg_gq */ + struct blkg_stat idle_time; + /* total time with empty current active q with other requests queued */ + struct blkg_stat empty_time; + /* fields after this shouldn't be cleared on stat reset */ + uint64_t start_group_wait_time; + uint64_t start_idle_time; + uint64_t start_empty_time; + uint16_t flags; +}; + +/* + * struct bfq_group_data - per-blkcg storage for the blkio subsystem. + * + * @ps: @blkcg_policy_storage that this structure inherits + * @weight: weight of the bfq_group + */ +struct bfq_group_data { + /* must be the first member */ + struct blkcg_policy_data pd; + + unsigned short weight; +}; + +/** + * struct bfq_group - per (device, cgroup) data structure. + * @entity: schedulable entity to insert into the parent group sched_data. + * @sched_data: own sched_data, to contain child entities (they may be + * both bfq_queues and bfq_groups). + * @bfqd: the bfq_data for the device this group acts upon. + * @async_bfqq: array of async queues for all the tasks belonging to + * the group, one queue per ioprio value per ioprio_class, + * except for the idle class that has only one queue. + * @async_idle_bfqq: async queue for the idle class (ioprio is ignored). + * @my_entity: pointer to @entity, %NULL for the toplevel group; used + * to avoid too many special cases during group creation/ + * migration. + * @active_entities: number of active entities belonging to the group; + * unused for the root group. Used to know whether there + * are groups with more than one active @bfq_entity + * (see the comments to the function + * bfq_bfqq_must_not_expire()). + * @rq_pos_tree: rbtree sorted by next_request position, used when + * determining if two or more queues have interleaving + * requests (see bfq_find_close_cooperator()). + * + * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup + * there is a set of bfq_groups, each one collecting the lower-level + * entities belonging to the group that are acting on the same device. + * + * Locking works as follows: + * o @bfqd is protected by the queue lock, RCU is used to access it + * from the readers. + * o All the other fields are protected by the @bfqd queue lock. + */ +struct bfq_group { + /* must be the first member */ + struct blkg_policy_data pd; + + struct bfq_entity entity; + struct bfq_sched_data sched_data; + + void *bfqd; + + struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR]; + struct bfq_queue *async_idle_bfqq; + + struct bfq_entity *my_entity; + + int active_entities; + + struct rb_root rq_pos_tree; + + struct bfqg_stats stats; + struct bfqg_stats dead_stats; /* stats pushed from dead children */ +}; + +#else +struct bfq_group { + struct bfq_sched_data sched_data; + + struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR]; + struct bfq_queue *async_idle_bfqq; + + struct rb_root rq_pos_tree; +}; +#endif + +static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity); + +static struct bfq_service_tree * +bfq_entity_service_tree(struct bfq_entity *entity) +{ + struct bfq_sched_data *sched_data = entity->sched_data; + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + unsigned int idx = bfqq ? bfqq->ioprio_class - 1 : + BFQ_DEFAULT_GRP_CLASS; + + BUG_ON(idx >= BFQ_IOPRIO_CLASSES); + BUG_ON(sched_data == NULL); + + return sched_data->service_tree + idx; +} + +static struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync) +{ + return bic->bfqq[is_sync]; +} + +static void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, + bool is_sync) +{ + bic->bfqq[is_sync] = bfqq; +} + +static struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic) +{ + return bic->icq.q->elevator->elevator_data; +} + +/** + * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer. + * @ptr: a pointer to a bfqd. + * @flags: storage for the flags to be saved. + * + * This function allows bfqg->bfqd to be protected by the + * queue lock of the bfqd they reference; the pointer is dereferenced + * under RCU, so the storage for bfqd is assured to be safe as long + * as the RCU read side critical section does not end. After the + * bfqd->queue->queue_lock is taken the pointer is rechecked, to be + * sure that no other writer accessed it. If we raced with a writer, + * the function returns NULL, with the queue unlocked, otherwise it + * returns the dereferenced pointer, with the queue locked. + */ +static struct bfq_data *bfq_get_bfqd_locked(void **ptr, unsigned long *flags) +{ + struct bfq_data *bfqd; + + rcu_read_lock(); + bfqd = rcu_dereference(*(struct bfq_data **)ptr); + + if (bfqd != NULL) { + spin_lock_irqsave(bfqd->queue->queue_lock, *flags); + if (ptr == NULL) + printk(KERN_CRIT "get_bfqd_locked pointer NULL\n"); + else if (*ptr == bfqd) + goto out; + spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags); + } + + bfqd = NULL; +out: + rcu_read_unlock(); + return bfqd; +} + +static void bfq_put_bfqd_unlock(struct bfq_data *bfqd, unsigned long *flags) +{ + spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags); +} + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + +static struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq) +{ + struct bfq_entity *group_entity = bfqq->entity.parent; + + if (!group_entity) + group_entity = &bfqq->bfqd->root_group->entity; + + return container_of(group_entity, struct bfq_group, entity); +} + +#else + +static struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq) +{ + return bfqq->bfqd->root_group; +} + +#endif + +static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio); +static void bfq_put_queue(struct bfq_queue *bfqq); +static void bfq_dispatch_insert(struct request_queue *q, struct request *rq); +static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, + struct bio *bio, int is_sync, + struct bfq_io_cq *bic, gfp_t gfp_mask); +static void bfq_end_wr_async_queues(struct bfq_data *bfqd, + struct bfq_group *bfqg); +static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg); +static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); + +#endif /* _BFQ_H */ diff --git a/block/blk-core.c b/block/blk-core.c index b83d297..64d137b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -48,6 +48,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug); DEFINE_IDA(blk_queue_ida); +int trap_non_toi_io; + /* * For the allocated request tables */ @@ -2104,6 +2106,9 @@ blk_qc_t submit_bio(int rw, struct bio *bio) { bio->bi_rw |= rw; + if (unlikely(trap_non_toi_io)) + BUG_ON(!bio_flagged(bio, BIO_TOI)); + /* * If it's a regular read/write or a barrier with data attached, * go through the normal accounting stuff before submission. diff --git a/block/genhd.c b/block/genhd.c index 9f42526..7a6a655 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -18,6 +18,8 @@ #include #include #include +#include +#include #include #include #include @@ -1417,6 +1419,85 @@ int invalidate_partition(struct gendisk *disk, int partno) EXPORT_SYMBOL(invalidate_partition); +dev_t blk_lookup_fs_info(struct fs_info *seek) +{ + dev_t devt = MKDEV(0, 0); + struct class_dev_iter iter; + struct device *dev; + int best_score = 0; + + class_dev_iter_init(&iter, &block_class, NULL, &disk_type); + while (best_score < 3 && (dev = class_dev_iter_next(&iter))) { + struct gendisk *disk = dev_to_disk(dev); + struct disk_part_iter piter; + struct hd_struct *part; + + disk_part_iter_init(&piter, disk, DISK_PITER_INCL_PART0); + + while (best_score < 3 && (part = disk_part_iter_next(&piter))) { + int score = part_matches_fs_info(part, seek); + if (score > best_score) { + devt = part_devt(part); + best_score = score; + } + } + disk_part_iter_exit(&piter); + } + class_dev_iter_exit(&iter); + return devt; +} + +/* Caller uses NULL, key to start. For each match found, we return a bdev on + * which we have done blkdev_get, and we do the blkdev_put on block devices + * that are passed to us. When no more matches are found, we return NULL. + */ +struct block_device *next_bdev_of_type(struct block_device *last, + const char *key) +{ + dev_t devt = MKDEV(0, 0); + struct class_dev_iter iter; + struct device *dev; + struct block_device *next = NULL, *bdev; + int got_last = 0; + + if (!key) + goto out; + + class_dev_iter_init(&iter, &block_class, NULL, &disk_type); + while (!devt && (dev = class_dev_iter_next(&iter))) { + struct gendisk *disk = dev_to_disk(dev); + struct disk_part_iter piter; + struct hd_struct *part; + + disk_part_iter_init(&piter, disk, DISK_PITER_INCL_PART0); + + while ((part = disk_part_iter_next(&piter))) { + bdev = bdget(part_devt(part)); + if (last && !got_last) { + if (last == bdev) + got_last = 1; + continue; + } + + if (blkdev_get(bdev, FMODE_READ, 0)) + continue; + + if (bdev_matches_key(bdev, key)) { + next = bdev; + break; + } + + blkdev_put(bdev, FMODE_READ); + } + disk_part_iter_exit(&piter); + } + class_dev_iter_exit(&iter); +out: + if (last) + blkdev_put(last, FMODE_READ); + return next; +} + /* * Disk events - monitor disk events like media change and eject request. */ diff --git b/block/uuid.c b/block/uuid.c new file mode 100644 index 0000000..4610d7b --- /dev/null +++ b/block/uuid.c @@ -0,0 +1,509 @@ +#include +#include +#include +#include +#include + +static int debug_enabled; + +#define PRINTK(fmt, args...) do { \ + if (debug_enabled) \ + printk(KERN_DEBUG fmt, ## args); \ + } while(0) + +#define PRINT_HEX_DUMP(v1, v2, v3, v4, v5, v6, v7, v8) \ + do { \ + if (debug_enabled) \ + print_hex_dump(v1, v2, v3, v4, v5, v6, v7, v8); \ + } while(0) + +/* + * Simple UUID translation + */ + +struct uuid_info { + const char *key; + const char *name; + long bkoff; + unsigned sboff; + unsigned sig_len; + const char *magic; + int uuid_offset; + int last_mount_offset; + int last_mount_size; +}; + +/* + * Based on libuuid's blkid_magic array. Note that I don't + * have uuid offsets for all of these yet - mssing ones are 0x0. + * Further information welcome. + * + * Rearranged by page of fs signature for optimisation. + */ +static struct uuid_info uuid_list[] = { + { NULL, "oracleasm", 0, 32, 8, "ORCLDISK", 0x0, 0, 0 }, + { "ntfs", "ntfs", 0, 3, 8, "NTFS ", 0x0, 0, 0 }, + { "vfat", "vfat", 0, 0x52, 5, "MSWIN", 0x0, 0, 0 }, + { "vfat", "vfat", 0, 0x52, 8, "FAT32 ", 0x0, 0, 0 }, + { "vfat", "vfat", 0, 0x36, 5, "MSDOS", 0x0, 0, 0 }, + { "vfat", "vfat", 0, 0x36, 8, "FAT16 ", 0x0, 0, 0 }, + { "vfat", "vfat", 0, 0x36, 8, "FAT12 ", 0x0, 0, 0 }, + { "vfat", "vfat", 0, 0, 1, "\353", 0x0, 0, 0 }, + { "vfat", "vfat", 0, 0, 1, "\351", 0x0, 0, 0 }, + { "vfat", "vfat", 0, 0x1fe, 2, "\125\252", 0x0, 0, 0 }, + { "xfs", "xfs", 0, 0, 4, "XFSB", 0x20, 0, 0 }, + { "romfs", "romfs", 0, 0, 8, "-rom1fs-", 0x0, 0, 0 }, + { "bfs", "bfs", 0, 0, 4, "\316\372\173\033", 0, 0, 0 }, + { "cramfs", "cramfs", 0, 0, 4, "E=\315\050", 0x0, 0, 0 }, + { "qnx4", "qnx4", 0, 4, 6, "QNX4FS", 0, 0, 0 }, + { NULL, "crypt_LUKS", 0, 0, 6, "LUKS\xba\xbe", 0x0, 0, 0 }, + { "squashfs", "squashfs", 0, 0, 4, "sqsh", 0, 0, 0 }, + { "squashfs", "squashfs", 0, 0, 4, "hsqs", 0, 0, 0 }, + { "ocfs", "ocfs", 0, 8, 9, "OracleCFS", 0x0, 0, 0 }, + { "lvm2pv", "lvm2pv", 0, 0x018, 8, "LVM2 001", 0x0, 0, 0 }, + { "sysv", "sysv", 0, 0x3f8, 4, "\020~\030\375", 0, 0, 0 }, + { "ext", "ext", 1, 0x38, 2, "\123\357", 0x468, 0x42c, 4 }, + { "minix", "minix", 1, 0x10, 2, "\177\023", 0, 0, 0 }, + { "minix", "minix", 1, 0x10, 2, "\217\023", 0, 0, 0 }, + { "minix", "minix", 1, 0x10, 2, "\150\044", 0, 0, 0 }, + { "minix", "minix", 1, 0x10, 2, "\170\044", 0, 0, 0 }, + { "lvm2pv", "lvm2pv", 1, 0x018, 8, "LVM2 001", 0x0, 0, 0 }, + { "vxfs", "vxfs", 1, 0, 4, "\365\374\001\245", 0, 0, 0 }, + { "hfsplus", "hfsplus", 1, 0, 2, "BD", 0x0, 0, 0 }, + { "hfsplus", "hfsplus", 1, 0, 2, "H+", 0x0, 0, 0 }, + { "hfsplus", "hfsplus", 1, 0, 2, "HX", 0x0, 0, 0 }, + { "hfs", "hfs", 1, 0, 2, "BD", 0x0, 0, 0 }, + { "ocfs2", "ocfs2", 1, 0, 6, "OCFSV2", 0x0, 0, 0 }, + { "lvm2pv", "lvm2pv", 0, 0x218, 8, "LVM2 001", 0x0, 0, 0 }, + { "lvm2pv", "lvm2pv", 1, 0x218, 8, "LVM2 001", 0x0, 0, 0 }, + { "ocfs2", "ocfs2", 2, 0, 6, "OCFSV2", 0x0, 0, 0 }, + { "swap", "swap", 0, 0xff6, 10, "SWAP-SPACE", 0x40c, 0, 0 }, + { "swap", "swap", 0, 0xff6, 10, "SWAPSPACE2", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0xff6, 9, "S1SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0xff6, 9, "S2SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0xff6, 9, "ULSUSPEND", 0x40c, 0, 0 }, + { "ocfs2", "ocfs2", 4, 0, 6, "OCFSV2", 0x0, 0, 0 }, + { "ocfs2", "ocfs2", 8, 0, 6, "OCFSV2", 0x0, 0, 0 }, + { "hpfs", "hpfs", 8, 0, 4, "I\350\225\371", 0, 0, 0 }, + { "reiserfs", "reiserfs", 8, 0x34, 8, "ReIsErFs", 0x10054, 0, 0 }, + { "reiserfs", "reiserfs", 8, 20, 8, "ReIsErFs", 0x10054, 0, 0 }, + { "zfs", "zfs", 8, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c", 0x0, 0, 0 }, + { "zfs", "zfs", 8, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0", 0x0, 0, 0 }, + { "ufs", "ufs", 8, 0x55c, 4, "T\031\001\000", 0, 0, 0 }, + { "swap", "swap", 0, 0x1ff6, 10, "SWAP-SPACE", 0x40c, 0, 0 }, + { "swap", "swap", 0, 0x1ff6, 10, "SWAPSPACE2", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0x1ff6, 9, "S1SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0x1ff6, 9, "S2SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0x1ff6, 9, "ULSUSPEND", 0x40c, 0, 0 }, + { "reiserfs", "reiserfs", 64, 0x34, 9, "ReIsEr2Fs", 0x10054, 0, 0 }, + { "reiserfs", "reiserfs", 64, 0x34, 9, "ReIsEr3Fs", 0x10054, 0, 0 }, + { "reiserfs", "reiserfs", 64, 0x34, 8, "ReIsErFs", 0x10054, 0, 0 }, + { "reiser4", "reiser4", 64, 0, 7, "ReIsEr4", 0x100544, 0, 0 }, + { "gfs2", "gfs2", 64, 0, 4, "\x01\x16\x19\x70", 0x0, 0, 0 }, + { "gfs", "gfs", 64, 0, 4, "\x01\x16\x19\x70", 0x0, 0, 0 }, + { "btrfs", "btrfs", 64, 0x40, 8, "_BHRfS_M", 0x0, 0, 0 }, + { "swap", "swap", 0, 0x3ff6, 10, "SWAP-SPACE", 0x40c, 0, 0 }, + { "swap", "swap", 0, 0x3ff6, 10, "SWAPSPACE2", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0x3ff6, 9, "S1SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0x3ff6, 9, "S2SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0x3ff6, 9, "ULSUSPEND", 0x40c, 0, 0 }, + { "udf", "udf", 32, 1, 5, "BEA01", 0x0, 0, 0 }, + { "udf", "udf", 32, 1, 5, "BOOT2", 0x0, 0, 0 }, + { "udf", "udf", 32, 1, 5, "CD001", 0x0, 0, 0 }, + { "udf", "udf", 32, 1, 5, "CDW02", 0x0, 0, 0 }, + { "udf", "udf", 32, 1, 5, "NSR02", 0x0, 0, 0 }, + { "udf", "udf", 32, 1, 5, "NSR03", 0x0, 0, 0 }, + { "udf", "udf", 32, 1, 5, "TEA01", 0x0, 0, 0 }, + { "iso9660", "iso9660", 32, 1, 5, "CD001", 0x0, 0, 0 }, + { "iso9660", "iso9660", 32, 9, 5, "CDROM", 0x0, 0, 0 }, + { "jfs", "jfs", 32, 0, 4, "JFS1", 0x88, 0, 0 }, + { "swap", "swap", 0, 0x7ff6, 10, "SWAP-SPACE", 0x40c, 0, 0 }, + { "swap", "swap", 0, 0x7ff6, 10, "SWAPSPACE2", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0x7ff6, 9, "S1SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0x7ff6, 9, "S2SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0x7ff6, 9, "ULSUSPEND", 0x40c, 0, 0 }, + { "swap", "swap", 0, 0xfff6, 10, "SWAP-SPACE", 0x40c, 0, 0 }, + { "swap", "swap", 0, 0xfff6, 10, "SWAPSPACE2", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0xfff6, 9, "S1SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0xfff6, 9, "S2SUSPEND", 0x40c, 0, 0 }, + { "swap", "swsuspend", 0, 0xfff6, 9, "ULSUSPEND", 0x40c, 0, 0 }, + { "zfs", "zfs", 264, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c", 0x0, 0, 0 }, + { "zfs", "zfs", 264, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0", 0x0, 0, 0 }, + { NULL, NULL, 0, 0, 0, NULL, 0x0, 0, 0 } +}; + +static int null_uuid(const char *uuid) +{ + int i; + + for (i = 0; i < 16 && !uuid[i]; i++); + + return (i == 16); +} + + +static void uuid_end_bio(struct bio *bio) +{ + struct page *page = bio->bi_io_vec[0].bv_page; + + if (bio->bi_error) + SetPageError(page); + + unlock_page(page); + bio_put(bio); +} + + +/** + * submit - submit BIO request + * @dev: The block device we're using. + * @page_num: The page we're reading. + * + * Based on Patrick Mochell's pmdisk code from long ago: "Straight from the + * textbook - allocate and initialize the bio. If we're writing, make sure + * the page is marked as dirty. Then submit it and carry on." + **/ +static struct page *read_bdev_page(struct block_device *dev, int page_num) +{ + struct bio *bio = NULL; + struct page *page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); + + if (!page) { + printk(KERN_ERR "Failed to allocate a page for reading data " + "in UUID checks."); + return NULL; + } + + bio = bio_alloc(GFP_NOFS, 1); + bio->bi_bdev = dev; + bio->bi_iter.bi_sector = page_num << 3; + bio->bi_end_io = uuid_end_bio; + bio->bi_flags |= (1 << BIO_TOI); + + PRINTK("Submitting bio on device %lx, page %d using bio %p and page %p.\n", + (unsigned long) dev->bd_dev, page_num, bio, page); + + if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) { + printk(KERN_DEBUG "ERROR: adding page to bio at %d\n", + page_num); + bio_put(bio); + __free_page(page); + printk(KERN_DEBUG "read_bdev_page freed page %p (in error " + "path).\n", page); + return NULL; + } + + lock_page(page); + submit_bio(READ | REQ_SYNC, bio); + + wait_on_page_locked(page); + if (PageError(page)) { + __free_page(page); + page = NULL; + } + return page; +} + +int bdev_matches_key(struct block_device *bdev, const char *key) +{ + unsigned char *data = NULL; + struct page *data_page = NULL; + + int dev_offset, pg_num, pg_off, i; + int last_pg_num = -1; + int result = 0; + char buf[50]; + + if (null_uuid(key)) { + PRINTK("Refusing to find a NULL key.\n"); + return 0; + } + + if (!bdev->bd_disk) { + bdevname(bdev, buf); + PRINTK("bdev %s has no bd_disk.\n", buf); + return 0; + } + + if (!bdev->bd_disk->queue) { + bdevname(bdev, buf); + PRINTK("bdev %s has no queue.\n", buf); + return 0; + } + + for (i = 0; uuid_list[i].name; i++) { + struct uuid_info *dat = &uuid_list[i]; + + if (!dat->key || strcmp(dat->key, key)) + continue; + + dev_offset = (dat->bkoff << 10) + dat->sboff; + pg_num = dev_offset >> 12; + pg_off = dev_offset & 0xfff; + + if ((((pg_num + 1) << 3) - 1) > bdev->bd_part->nr_sects >> 1) + continue; + + if (pg_num != last_pg_num) { + if (data_page) { + kunmap(data_page); + __free_page(data_page); + } + data_page = read_bdev_page(bdev, pg_num); + if (!data_page) + continue; + data = kmap(data_page); + } + + last_pg_num = pg_num; + + if (strncmp(&data[pg_off], dat->magic, dat->sig_len)) + continue; + + result = 1; + break; + } + + if (data_page) { + kunmap(data_page); + __free_page(data_page); + } + + return result; +} + +/* + * part_matches_fs_info - Does the given partition match the details given? + * + * Returns a score saying how good the match is. + * 0 = no UUID match. + * 1 = UUID but last mount time differs. + * 2 = UUID, last mount time but not dev_t + * 3 = perfect match + * + * This lets us cope elegantly with probing resulting in dev_ts changing + * from boot to boot, and with the case where a user copies a partition + * (UUID is non unique), and we need to check the last mount time of the + * correct partition. + */ +int part_matches_fs_info(struct hd_struct *part, struct fs_info *seek) +{ + struct block_device *bdev; + struct fs_info *got; + int result = 0; + char buf[50]; + + if (null_uuid((char *) &seek->uuid)) { + PRINTK("Refusing to find a NULL uuid.\n"); + return 0; + } + + bdev = bdget(part_devt(part)); + + PRINTK("part_matches fs info considering %x.\n", part_devt(part)); + + if (blkdev_get(bdev, FMODE_READ, 0)) { + PRINTK("blkdev_get failed.\n"); + return 0; + } + + if (!bdev->bd_disk) { + bdevname(bdev, buf); + PRINTK("bdev %s has no bd_disk.\n", buf); + goto out; + } + + if (!bdev->bd_disk->queue) { + bdevname(bdev, buf); + PRINTK("bdev %s has no queue.\n", buf); + goto out; + } + + got = fs_info_from_block_dev(bdev); + + if (got && !memcmp(got->uuid, seek->uuid, 16)) { + PRINTK(" Have matching UUID.\n"); + PRINTK(" Got: LMS %d, LM %p.\n", got->last_mount_size, got->last_mount); + PRINTK(" Seek: LMS %d, LM %p.\n", seek->last_mount_size, seek->last_mount); + result = 1; + + if (got->last_mount_size == seek->last_mount_size && + got->last_mount && seek->last_mount && + !memcmp(got->last_mount, seek->last_mount, + got->last_mount_size)) { + result = 2; + + PRINTK(" Matching last mount time.\n"); + + if (part_devt(part) == seek->dev_t) { + result = 3; + PRINTK(" Matching dev_t.\n"); + } else + PRINTK("Dev_ts differ (%x vs %x).\n", part_devt(part), seek->dev_t); + } + } + + PRINTK(" Score for %x is %d.\n", part_devt(part), result); + free_fs_info(got); +out: + blkdev_put(bdev, FMODE_READ); + return result; +} + +void free_fs_info(struct fs_info *fs_info) +{ + if (!fs_info || IS_ERR(fs_info)) + return; + + if (fs_info->last_mount) + kfree(fs_info->last_mount); + + kfree(fs_info); +} + +struct fs_info *fs_info_from_block_dev(struct block_device *bdev) +{ + unsigned char *data = NULL; + struct page *data_page = NULL; + + int dev_offset, pg_num, pg_off; + int uuid_pg_num, uuid_pg_off, i; + unsigned char *uuid_data = NULL; + struct page *uuid_data_page = NULL; + + int last_pg_num = -1, last_uuid_pg_num = 0; + char buf[50]; + struct fs_info *fs_info = NULL; + + bdevname(bdev, buf); + + PRINTK("uuid_from_block_dev looking for partition type of %s.\n", buf); + + for (i = 0; uuid_list[i].name; i++) { + struct uuid_info *dat = &uuid_list[i]; + dev_offset = (dat->bkoff << 10) + dat->sboff; + pg_num = dev_offset >> 12; + pg_off = dev_offset & 0xfff; + uuid_pg_num = dat->uuid_offset >> 12; + uuid_pg_off = dat->uuid_offset & 0xfff; + + if ((((pg_num + 1) << 3) - 1) > bdev->bd_part->nr_sects >> 1) + continue; + + /* Ignore partition types with no UUID offset */ + if (!dat->uuid_offset) + continue; + + if (pg_num != last_pg_num) { + if (data_page) { + kunmap(data_page); + __free_page(data_page); + } + data_page = read_bdev_page(bdev, pg_num); + if (!data_page) + continue; + data = kmap(data_page); + } + + last_pg_num = pg_num; + + if (strncmp(&data[pg_off], dat->magic, dat->sig_len)) + continue; + + PRINTK("This partition looks like %s.\n", dat->name); + + fs_info = kzalloc(sizeof(struct fs_info), GFP_KERNEL); + + if (!fs_info) { + PRINTK("Failed to allocate fs_info struct."); + fs_info = ERR_PTR(-ENOMEM); + break; + } + + /* UUID can't be off the end of the disk */ + if ((uuid_pg_num > bdev->bd_part->nr_sects >> 3) || + !dat->uuid_offset) + goto no_uuid; + + if (!uuid_data || uuid_pg_num != last_uuid_pg_num) { + /* No need to reread the page from above */ + if (uuid_pg_num == pg_num && uuid_data) + memcpy(uuid_data, data, PAGE_SIZE); + else { + if (uuid_data_page) { + kunmap(uuid_data_page); + __free_page(uuid_data_page); + } + uuid_data_page = read_bdev_page(bdev, uuid_pg_num); + if (!uuid_data_page) + continue; + uuid_data = kmap(uuid_data_page); + } + } + + last_uuid_pg_num = uuid_pg_num; + memcpy(&fs_info->uuid, &uuid_data[uuid_pg_off], 16); + fs_info->dev_t = bdev->bd_dev; + +no_uuid: + PRINT_HEX_DUMP(KERN_EMERG, "fs_info_from_block_dev " + "returning uuid ", DUMP_PREFIX_NONE, 16, 1, + fs_info->uuid, 16, 0); + + if (dat->last_mount_size) { + int pg = dat->last_mount_offset >> 12, sz; + int off = dat->last_mount_offset & 0xfff; + struct page *last_mount = read_bdev_page(bdev, pg); + unsigned char *last_mount_data; + char *ptr; + + if (!last_mount) { + fs_info = ERR_PTR(-ENOMEM); + break; + } + last_mount_data = kmap(last_mount); + sz = dat->last_mount_size; + ptr = kmalloc(sz, GFP_KERNEL); + + if (!ptr) { + printk(KERN_EMERG "fs_info_from_block_dev " + "failed to get memory for last mount " + "timestamp."); + free_fs_info(fs_info); + fs_info = ERR_PTR(-ENOMEM); + } else { + fs_info->last_mount = ptr; + fs_info->last_mount_size = sz; + memcpy(ptr, &last_mount_data[off], sz); + } + + kunmap(last_mount); + __free_page(last_mount); + } + break; + } + + if (data_page) { + kunmap(data_page); + __free_page(data_page); + } + + if (uuid_data_page) { + kunmap(uuid_data_page); + __free_page(uuid_data_page); + } + + return fs_info; +} + +static int __init uuid_debug_setup(char *str) +{ + int value; + + if (sscanf(str, "=%d", &value)) + debug_enabled = value; + + return 1; +} + +__setup("uuid_debug", uuid_debug_setup); diff --git a/drivers/accessibility/braille/braille_console.c b/drivers/accessibility/braille/braille_console.c index dc34a5b..020118b 100644 --- a/drivers/accessibility/braille/braille_console.c +++ b/drivers/accessibility/braille/braille_console.c @@ -116,7 +116,7 @@ static void braille_write(u16 *buf) *c++ = csum; *c++ = ETX; - braille_co->write(braille_co, data, c - data); + braille_co->write(braille_co, data, c - data, 0); } /* Follow the VC cursor*/ diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 55e257c..474f9f4 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -2096,7 +2096,11 @@ static int ata_dev_config_ncq(struct ata_device *dev, return 0; } if (ap->flags & ATA_FLAG_NCQ) { +#ifdef CONFIG_PCK_INTERACTIVE + hdepth = min(ap->scsi_host->can_queue, 8); +#else hdepth = min(ap->scsi_host->can_queue, ATA_MAX_QUEUE - 1); +#endif dev->flags |= ATA_DFLAG_NCQ; } diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index eae5107..9e6fe9e 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -20,7 +20,11 @@ /* On-demand governor macros */ #define DEF_FREQUENCY_UP_THRESHOLD (80) +#ifdef CONFIG_PCK_INTERACTIVE +#define DEF_SAMPLING_DOWN_FACTOR (10) +#else #define DEF_SAMPLING_DOWN_FACTOR (1) +#endif #define MAX_SAMPLING_DOWN_FACTOR (100000) #define MICRO_FREQUENCY_UP_THRESHOLD (95) #define MICRO_FREQUENCY_MIN_SAMPLE_RATE (10000) diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 2e8c77e..8d75300 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -137,7 +137,7 @@ int drm_gem_object_init(struct drm_device *dev, drm_gem_private_object_init(dev, obj, size); - filp = shmem_file_setup("drm mm object", size, VM_NORESERVE); + filp = shmem_file_setup("drm mm object", size, VM_NORESERVE, 1); if (IS_ERR(filp)) return PTR_ERR(filp); diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index f357058..08854ae 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -761,12 +761,12 @@ static int i915_drm_resume(struct drm_device *dev) dev_priv->display.hpd_irq_setup(dev); spin_unlock_irq(&dev_priv->irq_lock); + intel_dp_mst_resume(dev); + drm_modeset_lock_all(dev); intel_display_resume(dev); drm_modeset_unlock_all(dev); - intel_dp_mst_resume(dev); - /* * ... but also need to make sure that hotplug processing * doesn't cause havoc. Like in the driver load code we don't diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c index cdc2c15..4f61712 100644 --- a/drivers/gpu/drm/i915/intel_dp.c +++ b/drivers/gpu/drm/i915/intel_dp.c @@ -6113,6 +6113,19 @@ void intel_dp_mst_resume(struct drm_device *dev) ret = drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr); if (ret != 0) { + /* + * For some reason, some laptops can't bring + * their MST docks back up immediately after + * resume and need to wait a short period of + * time before aux transactions with the dock + * become functional again. Until we find a + * proper fix for this, this workaround should + * suffice + */ + msleep(30); + ret = drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr); + } + if (ret != 0) { intel_dp_check_mst_status(&intel_dig_port->dp); } } diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c index ca4c01f..2062228 100644 --- a/drivers/gpu/drm/radeon/r600.c +++ b/drivers/gpu/drm/radeon/r600.c @@ -2489,7 +2489,7 @@ int r600_init_microcode(struct radeon_device *rdev) } DRM_INFO("Loading %s Microcode\n", chip_name); - +#if 0 snprintf(fw_name, sizeof(fw_name), "/*(DEBLOBBED)*/", chip_name); err = reject_firmware(&rdev->pfp_fw, fw_name, rdev->dev); if (err) @@ -2541,7 +2541,7 @@ int r600_init_microcode(struct radeon_device *rdev) err = -EINVAL; } } - +#endif out: if (err) { if (err != -EINVAL) @@ -3201,7 +3201,7 @@ int r600_init(struct radeon_device *rdev) r = radeon_bo_init(rdev); if (r) return r; - +#if 0 if (!rdev->me_fw || !rdev->pfp_fw || !rdev->rlc_fw) { r = r600_init_microcode(rdev); if (r) { @@ -3209,7 +3209,7 @@ int r600_init(struct radeon_device *rdev) return r; } } - +#endif /* Initialize power management */ radeon_pm_init(rdev); diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index 4e19d0f..aaaf683 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -339,7 +339,7 @@ int ttm_tt_swapout(struct ttm_tt *ttm, struct file *persistent_swap_storage) if (!persistent_swap_storage) { swap_storage = shmem_file_setup("ttm swap", ttm->num_pages << PAGE_SHIFT, - 0); + 0, 0); if (IS_ERR(swap_storage)) { pr_err("Failed allocating swap storage\n"); return PTR_ERR(swap_storage); diff --git a/drivers/input/joydev.c b/drivers/input/joydev.c index 5d11fea..e819a60 100644 --- a/drivers/input/joydev.c +++ b/drivers/input/joydev.c @@ -28,15 +28,21 @@ #include #include #include +#include +#include MODULE_AUTHOR("Vojtech Pavlik "); MODULE_DESCRIPTION("Joystick device interfaces"); MODULE_SUPPORTED_DEVICE("input/js"); MODULE_LICENSE("GPL"); - #define JOYDEV_MINOR_BASE 0 #define JOYDEV_MINORS 16 #define JOYDEV_BUFFER_SIZE 64 +#define MAX_REMAP_SIZE 10 + +static int remap_array[MAX_REMAP_SIZE]; +static int remap_count = 0; +static int free_buttons[MAX_REMAP_SIZE]; struct joydev { int open; @@ -71,6 +77,9 @@ struct joydev_client { struct list_head node; }; +module_param_array(remap_array, int, &remap_count, 0 ); +MODULE_PARM_DESC( remap_array, "remap axis to buttons\n" ); + static int joydev_correct(int value, struct js_corr *corr) { switch (corr->type) { @@ -121,6 +130,17 @@ static void joydev_event(struct input_handle *handle, struct joydev *joydev = handle->private; struct joydev_client *client; struct js_event event; + int i; + + if( remap_count > 0 && remap_count < MAX_REMAP_SIZE ){ + for( i = 0; i < remap_count; i++ ) + if( code == remap_array[i] ){ + type = EV_KEY; + code = free_buttons[i]; + if( value == 255 ) + value = 1; + } + } switch (type) { @@ -816,7 +836,7 @@ static int joydev_connect(struct input_handler *handler, struct input_dev *dev, const struct input_device_id *id) { struct joydev *joydev; - int i, j, t, minor, dev_no; + int i, j = 0, t, minor, dev_no; int error; minor = input_get_new_minor(JOYDEV_MINOR_BASE, JOYDEV_MINORS, true); @@ -860,15 +880,24 @@ static int joydev_connect(struct input_handler *handler, struct input_dev *dev, joydev->keymap[i] = joydev->nkey; joydev->keypam[joydev->nkey] = i + BTN_MISC; joydev->nkey++; - } + j = i; + } + if( remap_count > 0 && remap_count < MAX_REMAP_SIZE ){ + printk( "[joydev] axis remapping enabled\n" ); + for( i = 0; i < remap_count; i++ ){ + joydev->keymap[j + i + 1] = joydev->nkey; + joydev->keypam[joydev->nkey] = i + j + 1 + BTN_MISC; + free_buttons[i] = j + i + 1 + BTN_MISC; + joydev->nkey++; + } for (i = 0; i < BTN_JOYSTICK - BTN_MISC; i++) if (test_bit(i + BTN_MISC, dev->keybit)) { joydev->keymap[i] = joydev->nkey; joydev->keypam[joydev->nkey] = i + BTN_MISC; joydev->nkey++; } - + } for (i = 0; i < joydev->nabs; i++) { j = joydev->abspam[i]; if (input_abs_get_max(dev, j) == input_abs_get_min(dev, j)) { diff --git a/drivers/input/mouse/synaptics.c b/drivers/input/mouse/synaptics.c index 6025eb4..e678212 100644 --- a/drivers/input/mouse/synaptics.c +++ b/drivers/input/mouse/synaptics.c @@ -1250,7 +1250,9 @@ static void set_input_params(struct psmouse *psmouse, /* Clickpads report only left button */ __clear_bit(BTN_RIGHT, dev->keybit); __clear_bit(BTN_MIDDLE, dev->keybit); - } + } else if (SYN_CAP_CLICKPAD2BTN(priv->ext_cap_0c) || + SYN_CAP_CLICKPAD2BTN2(priv->ext_cap_0c)) + __set_bit(INPUT_PROP_BUTTONPAD, dev->propbit); } static ssize_t synaptics_show_disable_gesture(struct psmouse *psmouse, diff --git a/drivers/input/mouse/synaptics.h b/drivers/input/mouse/synaptics.h index 56faa7e..c20f32c 100644 --- a/drivers/input/mouse/synaptics.h +++ b/drivers/input/mouse/synaptics.h @@ -85,6 +85,7 @@ */ #define SYN_CAP_CLICKPAD(ex0c) ((ex0c) & 0x100000) /* 1-button ClickPad */ #define SYN_CAP_CLICKPAD2BTN(ex0c) ((ex0c) & 0x000100) /* 2-button ClickPad */ +#define SYN_CAP_CLICKPAD2BTN2(ex0c) ((ex0c) & 0x200000) /* 2-button ClickPad */ #define SYN_CAP_MAX_DIMENSIONS(ex0c) ((ex0c) & 0x020000) #define SYN_CAP_MIN_DIMENSIONS(ex0c) ((ex0c) & 0x002000) #define SYN_CAP_ADV_GESTURE(ex0c) ((ex0c) & 0x080000) diff --git a/drivers/input/touchscreen/atmel_mxt_ts.c b/drivers/input/touchscreen/atmel_mxt_ts.c index ce2fc48..e0c714f 100644 --- a/drivers/input/touchscreen/atmel_mxt_ts.c +++ b/drivers/input/touchscreen/atmel_mxt_ts.c @@ -1973,7 +1973,7 @@ static int mxt_initialize(struct mxt_data *data) if (error) goto err_free_object_table; - error = reject_firmware_nowait(THIS_MODULE, true, MXT_CFG_NAME, + error = request_firmware_nowait(THIS_MODULE, true, MXT_CFG_NAME, &client->dev, GFP_KERNEL, data, mxt_config_cb); if (error) { diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig index 3e8b29e..c043885 100644 --- a/drivers/macintosh/Kconfig +++ b/drivers/macintosh/Kconfig @@ -171,6 +171,13 @@ config INPUT_ADBHID If unsure, say Y. +config ADB_TRACKPAD_ABSOLUTE + bool "Enable absolute mode for adb trackpads" + depends on INPUT_ADBHID + help + Enable absolute mode in adb-base trackpads. This feature adds + compatibility with synaptics Xorg / Xfree drivers. + config MAC_EMUMOUSEBTN tristate "Support for mouse button 2+3 emulation" depends on SYSCTL && INPUT diff --git a/drivers/macintosh/adbhid.c b/drivers/macintosh/adbhid.c index 09d72bb..8d23b27 100644 --- a/drivers/macintosh/adbhid.c +++ b/drivers/macintosh/adbhid.c @@ -261,6 +261,15 @@ static struct adb_ids buttons_ids; #define ADBMOUSE_MS_A3 8 /* Mouse systems A3 trackball (handler 3) */ #define ADBMOUSE_MACALLY2 9 /* MacAlly 2-button mouse */ +#ifdef CONFIG_ADB_TRACKPAD_ABSOLUTE +#define ABS_XMIN 310 +#define ABS_XMAX 1700 +#define ABS_YMIN 200 +#define ABS_YMAX 1000 +#define ABS_ZMIN 0 +#define ABS_ZMAX 55 +#endif + static void adbhid_keyboard_input(unsigned char *data, int nb, int apoll) { @@ -405,6 +414,9 @@ static void adbhid_mouse_input(unsigned char *data, int nb, int autopoll) { int id = (data[0] >> 4) & 0x0f; +#ifdef CONFIG_ADB_TRACKPAD_ABSOLUTE + int btn = 0; int x_axis = 0; int y_axis = 0; int z_axis = 0; +#endif if (!adbhid[id]) { printk(KERN_ERR "ADB HID on ID %d not yet registered\n", id); @@ -436,6 +448,17 @@ adbhid_mouse_input(unsigned char *data, int nb, int autopoll) high bits of y-axis motion. XY is additional high bits of x-axis motion. + For ADB Absolute motion protocol the data array will contain the + following values: + + BITS COMMENTS + data[0] = dddd 1100 ADB command: Talk, register 0, for device dddd. + data[1] = byyy yyyy Left button and y-axis motion. + data[2] = bxxx xxxx Second button and x-axis motion. + data[3] = 1yyy 1xxx Half bits of y-axis and x-axis motion. + data[4] = 1yyy 1xxx Higher bits of y-axis and x-axis motion. + data[5] = 1zzz 1zzz Higher and lower bits of z-pressure. + MacAlly 2-button mouse protocol. For MacAlly 2-button mouse protocol the data array will contain the @@ -458,8 +481,17 @@ adbhid_mouse_input(unsigned char *data, int nb, int autopoll) switch (adbhid[id]->mouse_kind) { case ADBMOUSE_TRACKPAD: +#ifdef CONFIG_ADB_TRACKPAD_ABSOLUTE + x_axis = (data[2] & 0x7f) | ((data[3] & 0x07) << 7) | + ((data[4] & 0x07) << 10); + y_axis = (data[1] & 0x7f) | ((data[3] & 0x70) << 3) | + ((data[4] & 0x70) << 6); + z_axis = (data[5] & 0x07) | ((data[5] & 0x70) >> 1); + btn = (!(data[1] >> 7)) & 1; +#else data[1] = (data[1] & 0x7f) | ((data[1] & data[2]) & 0x80); data[2] = data[2] | 0x80; +#endif break; case ADBMOUSE_MICROSPEED: data[1] = (data[1] & 0x7f) | ((data[3] & 0x01) << 7); @@ -485,17 +517,39 @@ adbhid_mouse_input(unsigned char *data, int nb, int autopoll) break; } - input_report_key(adbhid[id]->input, BTN_LEFT, !((data[1] >> 7) & 1)); - input_report_key(adbhid[id]->input, BTN_MIDDLE, !((data[2] >> 7) & 1)); +#ifdef CONFIG_ADB_TRACKPAD_ABSOLUTE + if ( adbhid[id]->mouse_kind == ADBMOUSE_TRACKPAD ) { - if (nb >= 4 && adbhid[id]->mouse_kind != ADBMOUSE_TRACKPAD) - input_report_key(adbhid[id]->input, BTN_RIGHT, !((data[3] >> 7) & 1)); + if(z_axis > 30) input_report_key(adbhid[id]->input, BTN_TOUCH, 1); + if(z_axis < 25) input_report_key(adbhid[id]->input, BTN_TOUCH, 0); - input_report_rel(adbhid[id]->input, REL_X, - ((data[2]&0x7f) < 64 ? (data[2]&0x7f) : (data[2]&0x7f)-128 )); - input_report_rel(adbhid[id]->input, REL_Y, - ((data[1]&0x7f) < 64 ? (data[1]&0x7f) : (data[1]&0x7f)-128 )); + if(z_axis > 0){ + input_report_abs(adbhid[id]->input, ABS_X, x_axis); + input_report_abs(adbhid[id]->input, ABS_Y, y_axis); + input_report_key(adbhid[id]->input, BTN_TOOL_FINGER, 1); + input_report_key(adbhid[id]->input, ABS_TOOL_WIDTH, 5); + } else { + input_report_key(adbhid[id]->input, BTN_TOOL_FINGER, 0); + input_report_key(adbhid[id]->input, ABS_TOOL_WIDTH, 0); + } + + input_report_abs(adbhid[id]->input, ABS_PRESSURE, z_axis); + input_report_key(adbhid[id]->input, BTN_LEFT, btn); + } else { +#endif + input_report_key(adbhid[id]->input, BTN_LEFT, !((data[1] >> 7) & 1)); + input_report_key(adbhid[id]->input, BTN_MIDDLE, !((data[2] >> 7) & 1)); + + if (nb >= 4 && adbhid[id]->mouse_kind != ADBMOUSE_TRACKPAD) + input_report_key(adbhid[id]->input, BTN_RIGHT, !((data[3] >> 7) & 1)); + input_report_rel(adbhid[id]->input, REL_X, + ((data[2]&0x7f) < 64 ? (data[2]&0x7f) : (data[2]&0x7f)-128 )); + input_report_rel(adbhid[id]->input, REL_Y, + ((data[1]&0x7f) < 64 ? (data[1]&0x7f) : (data[1]&0x7f)-128 )); +#ifdef CONFIG_ADB_TRACKPAD_ABSOLUTE + } +#endif input_sync(adbhid[id]->input); } @@ -849,6 +903,15 @@ adbhid_input_register(int id, int default_id, int original_handler_id, input_dev->keybit[BIT_WORD(BTN_MOUSE)] = BIT_MASK(BTN_LEFT) | BIT_MASK(BTN_MIDDLE) | BIT_MASK(BTN_RIGHT); input_dev->relbit[0] = BIT_MASK(REL_X) | BIT_MASK(REL_Y); +#ifdef CONFIG_ADB_TRACKPAD_ABSOLUTE + set_bit(EV_ABS, input_dev->evbit); + input_set_abs_params(input_dev, ABS_X, ABS_XMIN, ABS_XMAX, 0, 0); + input_set_abs_params(input_dev, ABS_Y, ABS_YMIN, ABS_YMAX, 0, 0); + input_set_abs_params(input_dev, ABS_PRESSURE, ABS_ZMIN, ABS_ZMAX, 0, 0); + set_bit(BTN_TOUCH, input_dev->keybit); + set_bit(BTN_TOOL_FINGER, input_dev->keybit); + set_bit(ABS_TOOL_WIDTH, input_dev->absbit); +#endif break; case ADB_MISC: @@ -1132,7 +1195,11 @@ init_trackpad(int id) r1_buffer[3], r1_buffer[4], r1_buffer[5], +#ifdef CONFIG_ADB_TRACKPAD_ABSOLUTE + 0x00, /* Enable absolute mode */ +#else 0x03, /*r1_buffer[6],*/ +#endif r1_buffer[7]); /* Without this flush, the trackpad may be locked up */ diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c index 06ee639..e8f7fbf 100644 --- a/drivers/net/netconsole.c +++ b/drivers/net/netconsole.c @@ -829,7 +829,7 @@ static void send_ext_msg_udp(struct netconsole_target *nt, const char *msg, } static void write_ext_msg(struct console *con, const char *msg, - unsigned int len) + unsigned int len, unsigned int loglevel) { struct netconsole_target *nt; unsigned long flags; @@ -844,7 +844,8 @@ static void write_ext_msg(struct console *con, const char *msg, spin_unlock_irqrestore(&target_list_lock, flags); } -static void write_msg(struct console *con, const char *msg, unsigned int len) +static void write_msg(struct console *con, const char *msg, unsigned int len, + unsigned int loglevel) { int frag, left; unsigned long flags; diff --git a/drivers/platform/x86/Kconfig b/drivers/platform/x86/Kconfig index 69f93a5..06fa7a3 100644 --- a/drivers/platform/x86/Kconfig +++ b/drivers/platform/x86/Kconfig @@ -490,9 +490,30 @@ config THINKPAD_ACPI_HOTKEY_POLL If you are not sure, say Y here. The driver enables polling only if it is strictly necessary to do so. +config THINKPAD_EC + tristate + depends on X86 + ---help--- + This is a low-level driver for accessing the ThinkPad H8S embedded + controller over the LPC bus (not to be confused with the ACPI Embedded + Controller interface). + +config TP_SMAPI + tristate "ThinkPad SMAPI Support" + depends on X86 + select THINKPAD_EC + default n + help + This adds SMAPI support on Lenovo/IBM ThinkPads, for features such + as battery charging control. For more information about this driver + see . + + If you have a Lenovo/IBM ThinkPad laptop, say Y or M here. + config SENSORS_HDAPS tristate "Thinkpad Hard Drive Active Protection System (hdaps)" depends on INPUT && X86 + select THINKPAD_EC select INPUT_POLLDEV default n help diff --git a/drivers/platform/x86/Makefile b/drivers/platform/x86/Makefile index 40574e7..926ab3e 100644 --- a/drivers/platform/x86/Makefile +++ b/drivers/platform/x86/Makefile @@ -26,6 +26,8 @@ obj-$(CONFIG_TC1100_WMI) += tc1100-wmi.o obj-$(CONFIG_SONY_LAPTOP) += sony-laptop.o obj-$(CONFIG_IDEAPAD_LAPTOP) += ideapad-laptop.o obj-$(CONFIG_THINKPAD_ACPI) += thinkpad_acpi.o +obj-$(CONFIG_THINKPAD_EC) += thinkpad_ec.o +obj-$(CONFIG_TP_SMAPI) += tp_smapi.o obj-$(CONFIG_SENSORS_HDAPS) += hdaps.o obj-$(CONFIG_FUJITSU_LAPTOP) += fujitsu-laptop.o obj-$(CONFIG_FUJITSU_TABLET) += fujitsu-tablet.o diff --git a/drivers/platform/x86/hdaps.c b/drivers/platform/x86/hdaps.c index 458e6c9..1dae4c5 100644 --- a/drivers/platform/x86/hdaps.c +++ b/drivers/platform/x86/hdaps.c @@ -30,266 +30,384 @@ #include #include -#include +#include #include -#include #include #include #include #include -#include - -#define HDAPS_LOW_PORT 0x1600 /* first port used by hdaps */ -#define HDAPS_NR_PORTS 0x30 /* number of ports: 0x1600 - 0x162f */ - -#define HDAPS_PORT_STATE 0x1611 /* device state */ -#define HDAPS_PORT_YPOS 0x1612 /* y-axis position */ -#define HDAPS_PORT_XPOS 0x1614 /* x-axis position */ -#define HDAPS_PORT_TEMP1 0x1616 /* device temperature, in Celsius */ -#define HDAPS_PORT_YVAR 0x1617 /* y-axis variance (what is this?) */ -#define HDAPS_PORT_XVAR 0x1619 /* x-axis variance (what is this?) */ -#define HDAPS_PORT_TEMP2 0x161b /* device temperature (again?) */ -#define HDAPS_PORT_UNKNOWN 0x161c /* what is this? */ -#define HDAPS_PORT_KMACT 0x161d /* keyboard or mouse activity */ - -#define STATE_FRESH 0x50 /* accelerometer data is fresh */ +#include +#include +#include + +/* Embedded controller accelerometer read command and its result: */ +static const struct thinkpad_ec_row ec_accel_args = + { .mask = 0x0001, .val = {0x11} }; +#define EC_ACCEL_IDX_READOUTS 0x1 /* readouts included in this read */ + /* First readout, if READOUTS>=1: */ +#define EC_ACCEL_IDX_YPOS1 0x2 /* y-axis position word */ +#define EC_ACCEL_IDX_XPOS1 0x4 /* x-axis position word */ +#define EC_ACCEL_IDX_TEMP1 0x6 /* device temperature in Celsius */ + /* Second readout, if READOUTS>=2: */ +#define EC_ACCEL_IDX_XPOS2 0x7 /* y-axis position word */ +#define EC_ACCEL_IDX_YPOS2 0x9 /* x-axis position word */ +#define EC_ACCEL_IDX_TEMP2 0xb /* device temperature in Celsius */ +#define EC_ACCEL_IDX_QUEUED 0xc /* Number of queued readouts left */ +#define EC_ACCEL_IDX_KMACT 0xd /* keyboard or mouse activity */ +#define EC_ACCEL_IDX_RETVAL 0xf /* command return value, good=0x00 */ #define KEYBD_MASK 0x20 /* set if keyboard activity */ #define MOUSE_MASK 0x40 /* set if mouse activity */ -#define KEYBD_ISSET(n) (!! (n & KEYBD_MASK)) /* keyboard used? */ -#define MOUSE_ISSET(n) (!! (n & MOUSE_MASK)) /* mouse used? */ -#define INIT_TIMEOUT_MSECS 4000 /* wait up to 4s for device init ... */ -#define INIT_WAIT_MSECS 200 /* ... in 200ms increments */ +#define READ_TIMEOUT_MSECS 100 /* wait this long for device read */ +#define RETRY_MSECS 3 /* retry delay */ -#define HDAPS_POLL_INTERVAL 50 /* poll for input every 1/20s (50 ms)*/ #define HDAPS_INPUT_FUZZ 4 /* input event threshold */ #define HDAPS_INPUT_FLAT 4 - -#define HDAPS_X_AXIS (1 << 0) -#define HDAPS_Y_AXIS (1 << 1) -#define HDAPS_BOTH_AXES (HDAPS_X_AXIS | HDAPS_Y_AXIS) - +#define KMACT_REMEMBER_PERIOD (HZ/10) /* keyboard/mouse persistance */ + +/* Input IDs */ +#define HDAPS_INPUT_VENDOR PCI_VENDOR_ID_IBM +#define HDAPS_INPUT_PRODUCT 0x5054 /* "TP", shared with thinkpad_acpi */ +#define HDAPS_INPUT_JS_VERSION 0x6801 /* Joystick emulation input device */ +#define HDAPS_INPUT_RAW_VERSION 0x4801 /* Raw accelerometer input device */ + +/* Axis orientation. */ +/* The unnatural bit-representation of inversions is for backward + * compatibility with the"invert=1" module parameter. */ +#define HDAPS_ORIENT_INVERT_XY 0x01 /* Invert both X and Y axes. */ +#define HDAPS_ORIENT_INVERT_X 0x02 /* Invert the X axis (uninvert if + * already inverted by INVERT_XY). */ +#define HDAPS_ORIENT_SWAP 0x04 /* Swap the axes. The swap occurs + * before inverting X or Y. */ +#define HDAPS_ORIENT_MAX 0x07 +#define HDAPS_ORIENT_UNDEFINED 0xFF /* Placeholder during initialization */ +#define HDAPS_ORIENT_INVERT_Y (HDAPS_ORIENT_INVERT_XY | HDAPS_ORIENT_INVERT_X) + +static struct timer_list hdaps_timer; static struct platform_device *pdev; -static struct input_polled_dev *hdaps_idev; -static unsigned int hdaps_invert; -static u8 km_activity; -static int rest_x; -static int rest_y; - -static DEFINE_MUTEX(hdaps_mtx); - -/* - * __get_latch - Get the value from a given port. Callers must hold hdaps_mtx. - */ -static inline u8 __get_latch(u16 port) +static struct input_dev *hdaps_idev; /* joystick-like device with fuzz */ +static struct input_dev *hdaps_idev_raw; /* raw hdaps sensor readouts */ +static unsigned int hdaps_invert = HDAPS_ORIENT_UNDEFINED; +static int needs_calibration; + +/* Configuration: */ +static int sampling_rate = 50; /* Sampling rate */ +static int oversampling_ratio = 5; /* Ratio between our sampling rate and + * EC accelerometer sampling rate */ +static int running_avg_filter_order = 2; /* EC running average filter order */ + +/* Latest state readout: */ +static int pos_x, pos_y; /* position */ +static int temperature; /* temperature */ +static int stale_readout = 1; /* last read invalid */ +static int rest_x, rest_y; /* calibrated rest position */ + +/* Last time we saw keyboard and mouse activity: */ +static u64 last_keyboard_jiffies = INITIAL_JIFFIES; +static u64 last_mouse_jiffies = INITIAL_JIFFIES; +static u64 last_update_jiffies = INITIAL_JIFFIES; + +/* input device use count */ +static int hdaps_users; +static DEFINE_MUTEX(hdaps_users_mtx); + +/* Some models require an axis transformation to the standard representation */ +static void transform_axes(int *x, int *y) { - return inb(port) & 0xff; + if (hdaps_invert & HDAPS_ORIENT_SWAP) { + int z; + z = *x; + *x = *y; + *y = z; + } + if (hdaps_invert & HDAPS_ORIENT_INVERT_XY) { + *x = -*x; + *y = -*y; + } + if (hdaps_invert & HDAPS_ORIENT_INVERT_X) + *x = -*x; } -/* - * __check_latch - Check a port latch for a given value. Returns zero if the - * port contains the given value. Callers must hold hdaps_mtx. +/** + * __hdaps_update - query current state, with locks already acquired + * @fast: if nonzero, do one quick attempt without retries. + * + * Query current accelerometer state and update global state variables. + * Also prefetches the next query. Caller must hold controller lock. */ -static inline int __check_latch(u16 port, u8 val) +static int __hdaps_update(int fast) { - if (__get_latch(port) == val) - return 0; - return -EINVAL; -} + /* Read data: */ + struct thinkpad_ec_row data; + int ret; -/* - * __wait_latch - Wait up to 100us for a port latch to get a certain value, - * returning zero if the value is obtained. Callers must hold hdaps_mtx. - */ -static int __wait_latch(u16 port, u8 val) -{ - unsigned int i; + data.mask = (1 << EC_ACCEL_IDX_READOUTS) | (1 << EC_ACCEL_IDX_KMACT) | + (3 << EC_ACCEL_IDX_YPOS1) | (3 << EC_ACCEL_IDX_XPOS1) | + (1 << EC_ACCEL_IDX_TEMP1) | (1 << EC_ACCEL_IDX_RETVAL); + if (fast) + ret = thinkpad_ec_try_read_row(&ec_accel_args, &data); + else + ret = thinkpad_ec_read_row(&ec_accel_args, &data); + thinkpad_ec_prefetch_row(&ec_accel_args); /* Prefetch even if error */ + if (ret) + return ret; - for (i = 0; i < 20; i++) { - if (!__check_latch(port, val)) - return 0; - udelay(5); + /* Check status: */ + if (data.val[EC_ACCEL_IDX_RETVAL] != 0x00) { + pr_warn("read RETVAL=0x%02x\n", + data.val[EC_ACCEL_IDX_RETVAL]); + return -EIO; + } + + if (data.val[EC_ACCEL_IDX_READOUTS] < 1) + return -EBUSY; /* no pending readout, try again later */ + + /* Parse position data: */ + pos_x = *(s16 *)(data.val+EC_ACCEL_IDX_XPOS1); + pos_y = *(s16 *)(data.val+EC_ACCEL_IDX_YPOS1); + transform_axes(&pos_x, &pos_y); + + /* Keyboard and mouse activity status is cleared as soon as it's read, + * so applications will eat each other's events. Thus we remember any + * event for KMACT_REMEMBER_PERIOD jiffies. + */ + if (data.val[EC_ACCEL_IDX_KMACT] & KEYBD_MASK) + last_keyboard_jiffies = get_jiffies_64(); + if (data.val[EC_ACCEL_IDX_KMACT] & MOUSE_MASK) + last_mouse_jiffies = get_jiffies_64(); + + temperature = data.val[EC_ACCEL_IDX_TEMP1]; + + last_update_jiffies = get_jiffies_64(); + stale_readout = 0; + if (needs_calibration) { + rest_x = pos_x; + rest_y = pos_y; + needs_calibration = 0; } - return -EIO; + return 0; } -/* - * __device_refresh - request a refresh from the accelerometer. Does not wait - * for refresh to complete. Callers must hold hdaps_mtx. +/** + * hdaps_update - acquire locks and query current state + * + * Query current accelerometer state and update global state variables. + * Also prefetches the next query. + * Retries until timeout if the accelerometer is not in ready status (common). + * Does its own locking. */ -static void __device_refresh(void) +static int hdaps_update(void) { - udelay(200); - if (inb(0x1604) != STATE_FRESH) { - outb(0x11, 0x1610); - outb(0x01, 0x161f); + u64 age = get_jiffies_64() - last_update_jiffies; + int total, ret; + + if (!stale_readout && age < (9*HZ)/(10*sampling_rate)) + return 0; /* already updated recently */ + for (total = 0; total < READ_TIMEOUT_MSECS; total += RETRY_MSECS) { + ret = thinkpad_ec_lock(); + if (ret) + return ret; + ret = __hdaps_update(0); + thinkpad_ec_unlock(); + + if (!ret) + return 0; + if (ret != -EBUSY) + break; + msleep(RETRY_MSECS); } + return ret; } -/* - * __device_refresh_sync - request a synchronous refresh from the - * accelerometer. We wait for the refresh to complete. Returns zero if - * successful and nonzero on error. Callers must hold hdaps_mtx. +/** + * hdaps_set_power - enable or disable power to the accelerometer. + * Returns zero on success and negative error code on failure. Can sleep. */ -static int __device_refresh_sync(void) +static int hdaps_set_power(int on) { - __device_refresh(); - return __wait_latch(0x1604, STATE_FRESH); + struct thinkpad_ec_row args = + { .mask = 0x0003, .val = {0x14, on?0x01:0x00} }; + struct thinkpad_ec_row data = { .mask = 0x8000 }; + int ret = thinkpad_ec_read_row(&args, &data); + if (ret) + return ret; + if (data.val[0xF] != 0x00) + return -EIO; + return 0; } -/* - * __device_complete - indicate to the accelerometer that we are done reading - * data, and then initiate an async refresh. Callers must hold hdaps_mtx. +/** + * hdaps_set_ec_config - set accelerometer parameters. + * @ec_rate: embedded controller sampling rate + * @order: embedded controller running average filter order + * (Normally we have @ec_rate = sampling_rate * oversampling_ratio.) + * Returns zero on success and negative error code on failure. Can sleep. */ -static inline void __device_complete(void) +static int hdaps_set_ec_config(int ec_rate, int order) { - inb(0x161f); - inb(0x1604); - __device_refresh(); + struct thinkpad_ec_row args = { .mask = 0x000F, + .val = {0x10, (u8)ec_rate, (u8)(ec_rate>>8), order} }; + struct thinkpad_ec_row data = { .mask = 0x8000 }; + int ret = thinkpad_ec_read_row(&args, &data); + pr_debug("setting ec_rate=%d, filter_order=%d\n", ec_rate, order); + if (ret) + return ret; + if (data.val[0xF] == 0x03) { + pr_warn("config param out of range\n"); + return -EINVAL; + } + if (data.val[0xF] == 0x06) { + pr_warn("config change already pending\n"); + return -EBUSY; + } + if (data.val[0xF] != 0x00) { + pr_warn("config change error, ret=%d\n", + data.val[0xF]); + return -EIO; + } + return 0; } -/* - * hdaps_readb_one - reads a byte from a single I/O port, placing the value in - * the given pointer. Returns zero on success or a negative error on failure. - * Can sleep. +/** + * hdaps_get_ec_config - get accelerometer parameters. + * @ec_rate: embedded controller sampling rate + * @order: embedded controller running average filter order + * Returns zero on success and negative error code on failure. Can sleep. */ -static int hdaps_readb_one(unsigned int port, u8 *val) +static int hdaps_get_ec_config(int *ec_rate, int *order) { - int ret; - - mutex_lock(&hdaps_mtx); - - /* do a sync refresh -- we need to be sure that we read fresh data */ - ret = __device_refresh_sync(); + const struct thinkpad_ec_row args = + { .mask = 0x0003, .val = {0x17, 0x82} }; + struct thinkpad_ec_row data = { .mask = 0x801F }; + int ret = thinkpad_ec_read_row(&args, &data); if (ret) - goto out; - - *val = inb(port); - __device_complete(); - -out: - mutex_unlock(&hdaps_mtx); - return ret; + return ret; + if (data.val[0xF] != 0x00) + return -EIO; + if (!(data.val[0x1] & 0x01)) + return -ENXIO; /* accelerometer polling not enabled */ + if (data.val[0x1] & 0x02) + return -EBUSY; /* config change in progress, retry later */ + *ec_rate = data.val[0x2] | ((int)(data.val[0x3]) << 8); + *order = data.val[0x4]; + return 0; } -/* __hdaps_read_pair - internal lockless helper for hdaps_read_pair(). */ -static int __hdaps_read_pair(unsigned int port1, unsigned int port2, - int *x, int *y) +/** + * hdaps_get_ec_mode - get EC accelerometer mode + * Returns zero on success and negative error code on failure. Can sleep. + */ +static int hdaps_get_ec_mode(u8 *mode) { - /* do a sync refresh -- we need to be sure that we read fresh data */ - if (__device_refresh_sync()) + const struct thinkpad_ec_row args = + { .mask = 0x0001, .val = {0x13} }; + struct thinkpad_ec_row data = { .mask = 0x8002 }; + int ret = thinkpad_ec_read_row(&args, &data); + if (ret) + return ret; + if (data.val[0xF] != 0x00) { + pr_warn("accelerometer not implemented (0x%02x)\n", + data.val[0xF]); return -EIO; - - *y = inw(port2); - *x = inw(port1); - km_activity = inb(HDAPS_PORT_KMACT); - __device_complete(); - - /* hdaps_invert is a bitvector to negate the axes */ - if (hdaps_invert & HDAPS_X_AXIS) - *x = -*x; - if (hdaps_invert & HDAPS_Y_AXIS) - *y = -*y; - + } + *mode = data.val[0x1]; return 0; } -/* - * hdaps_read_pair - reads the values from a pair of ports, placing the values - * in the given pointers. Returns zero on success. Can sleep. +/** + * hdaps_check_ec - checks something about the EC. + * Follows the clean-room spec for HDAPS; we don't know what it means. + * Returns zero on success and negative error code on failure. Can sleep. */ -static int hdaps_read_pair(unsigned int port1, unsigned int port2, - int *val1, int *val2) +static int hdaps_check_ec(void) { - int ret; - - mutex_lock(&hdaps_mtx); - ret = __hdaps_read_pair(port1, port2, val1, val2); - mutex_unlock(&hdaps_mtx); - - return ret; + const struct thinkpad_ec_row args = + { .mask = 0x0003, .val = {0x17, 0x81} }; + struct thinkpad_ec_row data = { .mask = 0x800E }; + int ret = thinkpad_ec_read_row(&args, &data); + if (ret) + return ret; + if (!((data.val[0x1] == 0x00 && data.val[0x2] == 0x60) || /* cleanroom spec */ + (data.val[0x1] == 0x01 && data.val[0x2] == 0x00)) || /* seen on T61 */ + data.val[0x3] != 0x00 || data.val[0xF] != 0x00) { + pr_warn("hdaps_check_ec: bad response (0x%x,0x%x,0x%x,0x%x)\n", + data.val[0x1], data.val[0x2], + data.val[0x3], data.val[0xF]); + return -EIO; + } + return 0; } -/* - * hdaps_device_init - initialize the accelerometer. Returns zero on success - * and negative error code on failure. Can sleep. +/** + * hdaps_device_init - initialize the accelerometer. + * + * Call several embedded controller functions to test and initialize the + * accelerometer. + * Returns zero on success and negative error code on failure. Can sleep. */ +#define FAILED_INIT(msg) pr_err("init failed at: %s\n", msg) static int hdaps_device_init(void) { - int total, ret = -ENXIO; + int ret; + u8 mode; - mutex_lock(&hdaps_mtx); + ret = thinkpad_ec_lock(); + if (ret) + return ret; - outb(0x13, 0x1610); - outb(0x01, 0x161f); - if (__wait_latch(0x161f, 0x00)) - goto out; + if (hdaps_get_ec_mode(&mode)) + { FAILED_INIT("hdaps_get_ec_mode failed"); goto bad; } - /* - * Most ThinkPads return 0x01. - * - * Others--namely the R50p, T41p, and T42p--return 0x03. These laptops - * have "inverted" axises. - * - * The 0x02 value occurs when the chip has been previously initialized. - */ - if (__check_latch(0x1611, 0x03) && - __check_latch(0x1611, 0x02) && - __check_latch(0x1611, 0x01)) - goto out; - - printk(KERN_DEBUG "hdaps: initial latch check good (0x%02x)\n", - __get_latch(0x1611)); + pr_debug("initial mode latch is 0x%02x\n", mode); + if (mode == 0x00) + { FAILED_INIT("accelerometer not available"); goto bad; } - outb(0x17, 0x1610); - outb(0x81, 0x1611); - outb(0x01, 0x161f); - if (__wait_latch(0x161f, 0x00)) - goto out; - if (__wait_latch(0x1611, 0x00)) - goto out; - if (__wait_latch(0x1612, 0x60)) - goto out; - if (__wait_latch(0x1613, 0x00)) - goto out; - outb(0x14, 0x1610); - outb(0x01, 0x1611); - outb(0x01, 0x161f); - if (__wait_latch(0x161f, 0x00)) - goto out; - outb(0x10, 0x1610); - outb(0xc8, 0x1611); - outb(0x00, 0x1612); - outb(0x02, 0x1613); - outb(0x01, 0x161f); - if (__wait_latch(0x161f, 0x00)) - goto out; - if (__device_refresh_sync()) - goto out; - if (__wait_latch(0x1611, 0x00)) - goto out; + if (hdaps_check_ec()) + { FAILED_INIT("hdaps_check_ec failed"); goto bad; } - /* we have done our dance, now let's wait for the applause */ - for (total = INIT_TIMEOUT_MSECS; total > 0; total -= INIT_WAIT_MSECS) { - int x, y; + if (hdaps_set_power(1)) + { FAILED_INIT("hdaps_set_power failed"); goto bad; } - /* a read of the device helps push it into action */ - __hdaps_read_pair(HDAPS_PORT_XPOS, HDAPS_PORT_YPOS, &x, &y); - if (!__wait_latch(0x1611, 0x02)) { - ret = 0; - break; - } + if (hdaps_set_ec_config(sampling_rate*oversampling_ratio, + running_avg_filter_order)) + { FAILED_INIT("hdaps_set_ec_config failed"); goto bad; } - msleep(INIT_WAIT_MSECS); - } + thinkpad_ec_invalidate(); + udelay(200); -out: - mutex_unlock(&hdaps_mtx); + /* Just prefetch instead of reading, to avoid ~1sec delay on load */ + ret = thinkpad_ec_prefetch_row(&ec_accel_args); + if (ret) + { FAILED_INIT("initial prefetch failed"); goto bad; } + goto good; +bad: + thinkpad_ec_invalidate(); + ret = -ENXIO; +good: + stale_readout = 1; + thinkpad_ec_unlock(); return ret; } +/** + * hdaps_device_shutdown - power off the accelerometer + * Returns nonzero on failure. Can sleep. + */ +static int hdaps_device_shutdown(void) +{ + int ret; + ret = hdaps_set_power(0); + if (ret) { + pr_warn("cannot power off\n"); + return ret; + } + ret = hdaps_set_ec_config(0, 1); + if (ret) + pr_warn("cannot stop EC sampling\n"); + return ret; +} /* Device model stuff */ @@ -306,13 +424,29 @@ static int hdaps_probe(struct platform_device *dev) } #ifdef CONFIG_PM_SLEEP +static int hdaps_suspend(struct device *dev) +{ + /* Don't do hdaps polls until resume re-initializes the sensor. */ + del_timer_sync(&hdaps_timer); + hdaps_device_shutdown(); /* ignore errors, effect is negligible */ + return 0; +} + static int hdaps_resume(struct device *dev) { - return hdaps_device_init(); + int ret = hdaps_device_init(); + if (ret) + return ret; + + mutex_lock(&hdaps_users_mtx); + if (hdaps_users) + mod_timer(&hdaps_timer, jiffies + HZ/sampling_rate); + mutex_unlock(&hdaps_users_mtx); + return 0; } #endif -static SIMPLE_DEV_PM_OPS(hdaps_pm, NULL, hdaps_resume); +static SIMPLE_DEV_PM_OPS(hdaps_pm, hdaps_suspend, hdaps_resume); static struct platform_driver hdaps_driver = { .probe = hdaps_probe, @@ -322,30 +456,47 @@ static struct platform_driver hdaps_driver = { }, }; -/* - * hdaps_calibrate - Set our "resting" values. Callers must hold hdaps_mtx. +/** + * hdaps_calibrate - set our "resting" values. + * Does its own locking. */ static void hdaps_calibrate(void) { - __hdaps_read_pair(HDAPS_PORT_XPOS, HDAPS_PORT_YPOS, &rest_x, &rest_y); + needs_calibration = 1; + hdaps_update(); + /* If that fails, the mousedev poll will take care of things later. */ } -static void hdaps_mousedev_poll(struct input_polled_dev *dev) +/* Timer handler for updating the input device. Runs in softirq context, + * so avoid lenghty or blocking operations. + */ +static void hdaps_mousedev_poll(unsigned long unused) { - struct input_dev *input_dev = dev->input; - int x, y; + int ret; - mutex_lock(&hdaps_mtx); + stale_readout = 1; - if (__hdaps_read_pair(HDAPS_PORT_XPOS, HDAPS_PORT_YPOS, &x, &y)) - goto out; + /* Cannot sleep. Try nonblockingly. If we fail, try again later. */ + if (thinkpad_ec_try_lock()) + goto keep_active; - input_report_abs(input_dev, ABS_X, x - rest_x); - input_report_abs(input_dev, ABS_Y, y - rest_y); - input_sync(input_dev); + ret = __hdaps_update(1); /* fast update, we're in softirq context */ + thinkpad_ec_unlock(); + /* Any of "successful", "not yet ready" and "not prefetched"? */ + if (ret != 0 && ret != -EBUSY && ret != -ENODATA) { + pr_err("poll failed, disabling updates\n"); + return; + } -out: - mutex_unlock(&hdaps_mtx); +keep_active: + /* Even if we failed now, pos_x,y may have been updated earlier: */ + input_report_abs(hdaps_idev, ABS_X, pos_x - rest_x); + input_report_abs(hdaps_idev, ABS_Y, pos_y - rest_y); + input_sync(hdaps_idev); + input_report_abs(hdaps_idev_raw, ABS_X, pos_x); + input_report_abs(hdaps_idev_raw, ABS_Y, pos_y); + input_sync(hdaps_idev_raw); + mod_timer(&hdaps_timer, jiffies + HZ/sampling_rate); } @@ -354,65 +505,41 @@ out: static ssize_t hdaps_position_show(struct device *dev, struct device_attribute *attr, char *buf) { - int ret, x, y; - - ret = hdaps_read_pair(HDAPS_PORT_XPOS, HDAPS_PORT_YPOS, &x, &y); - if (ret) - return ret; - - return sprintf(buf, "(%d,%d)\n", x, y); -} - -static ssize_t hdaps_variance_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - int ret, x, y; - - ret = hdaps_read_pair(HDAPS_PORT_XVAR, HDAPS_PORT_YVAR, &x, &y); + int ret = hdaps_update(); if (ret) return ret; - - return sprintf(buf, "(%d,%d)\n", x, y); + return sprintf(buf, "(%d,%d)\n", pos_x, pos_y); } static ssize_t hdaps_temp1_show(struct device *dev, struct device_attribute *attr, char *buf) { - u8 uninitialized_var(temp); - int ret; - - ret = hdaps_readb_one(HDAPS_PORT_TEMP1, &temp); - if (ret) - return ret; - - return sprintf(buf, "%u\n", temp); -} - -static ssize_t hdaps_temp2_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - u8 uninitialized_var(temp); - int ret; - - ret = hdaps_readb_one(HDAPS_PORT_TEMP2, &temp); + int ret = hdaps_update(); if (ret) return ret; - - return sprintf(buf, "%u\n", temp); + return sprintf(buf, "%d\n", temperature); } static ssize_t hdaps_keyboard_activity_show(struct device *dev, struct device_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", KEYBD_ISSET(km_activity)); + int ret = hdaps_update(); + if (ret) + return ret; + return sprintf(buf, "%u\n", + get_jiffies_64() < last_keyboard_jiffies + KMACT_REMEMBER_PERIOD); } static ssize_t hdaps_mouse_activity_show(struct device *dev, struct device_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", MOUSE_ISSET(km_activity)); + int ret = hdaps_update(); + if (ret) + return ret; + return sprintf(buf, "%u\n", + get_jiffies_64() < last_mouse_jiffies + KMACT_REMEMBER_PERIOD); } static ssize_t hdaps_calibrate_show(struct device *dev, @@ -425,10 +552,7 @@ static ssize_t hdaps_calibrate_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { - mutex_lock(&hdaps_mtx); hdaps_calibrate(); - mutex_unlock(&hdaps_mtx); - return count; } @@ -445,7 +569,7 @@ static ssize_t hdaps_invert_store(struct device *dev, int invert; if (sscanf(buf, "%d", &invert) != 1 || - invert < 0 || invert > HDAPS_BOTH_AXES) + invert < 0 || invert > HDAPS_ORIENT_MAX) return -EINVAL; hdaps_invert = invert; @@ -454,24 +578,128 @@ static ssize_t hdaps_invert_store(struct device *dev, return count; } +static ssize_t hdaps_sampling_rate_show( + struct device *dev, struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "%d\n", sampling_rate); +} + +static ssize_t hdaps_sampling_rate_store( + struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + int rate, ret; + if (sscanf(buf, "%d", &rate) != 1 || rate > HZ || rate <= 0) { + pr_warn("must have 0ident); - return 1; -} - /* hdaps_dmi_match_invert - found an inverted match. */ static int __init hdaps_dmi_match_invert(const struct dmi_system_id *id) { - hdaps_invert = (unsigned long)id->driver_data; - pr_info("inverting axis (%u) readings\n", hdaps_invert); - return hdaps_dmi_match(id); + unsigned int orient = (kernel_ulong_t) id->driver_data; + hdaps_invert = orient; + pr_info("%s detected, setting orientation %u\n", id->ident, orient); + return 1; /* stop enumeration */ } -#define HDAPS_DMI_MATCH_INVERT(vendor, model, axes) { \ +#define HDAPS_DMI_MATCH_INVERT(vendor, model, orient) { \ .ident = vendor " " model, \ .callback = hdaps_dmi_match_invert, \ - .driver_data = (void *)axes, \ + .driver_data = (void *)(orient), \ .matches = { \ DMI_MATCH(DMI_BOARD_VENDOR, vendor), \ DMI_MATCH(DMI_PRODUCT_VERSION, model) \ } \ } -#define HDAPS_DMI_MATCH_NORMAL(vendor, model) \ - HDAPS_DMI_MATCH_INVERT(vendor, model, 0) - -/* Note that HDAPS_DMI_MATCH_NORMAL("ThinkPad T42") would match - "ThinkPad T42p", so the order of the entries matters. - If your ThinkPad is not recognized, please update to latest - BIOS. This is especially the case for some R52 ThinkPads. */ -static struct dmi_system_id __initdata hdaps_whitelist[] = { - HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad R50p", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_NORMAL("IBM", "ThinkPad R50"), - HDAPS_DMI_MATCH_NORMAL("IBM", "ThinkPad R51"), - HDAPS_DMI_MATCH_NORMAL("IBM", "ThinkPad R52"), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad R61i", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad R61", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad T41p", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_NORMAL("IBM", "ThinkPad T41"), - HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad T42p", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_NORMAL("IBM", "ThinkPad T42"), - HDAPS_DMI_MATCH_NORMAL("IBM", "ThinkPad T43"), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T400", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T60", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T61p", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T61", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_NORMAL("IBM", "ThinkPad X40"), - HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad X41", HDAPS_Y_AXIS), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X60", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X61s", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X61", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_NORMAL("IBM", "ThinkPad Z60m"), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad Z61m", HDAPS_BOTH_AXES), - HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad Z61p", HDAPS_BOTH_AXES), +/* List of models with abnormal axis configuration. + Note that HDAPS_DMI_MATCH_NORMAL("ThinkPad T42") would match + "ThinkPad T42p", and enumeration stops after first match, + so the order of the entries matters. */ +struct dmi_system_id __initdata hdaps_whitelist[] = { + HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad R50p", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad R60", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad T41p", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad T42p", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad X40", HDAPS_ORIENT_INVERT_Y), + HDAPS_DMI_MATCH_INVERT("IBM", "ThinkPad X41", HDAPS_ORIENT_INVERT_Y), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad R60", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad R61", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad R400", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad R500", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T60", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T61", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X60 Tablet", HDAPS_ORIENT_INVERT_Y), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X60s", HDAPS_ORIENT_INVERT_Y), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X60", HDAPS_ORIENT_SWAP | HDAPS_ORIENT_INVERT_X), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X61", HDAPS_ORIENT_SWAP | HDAPS_ORIENT_INVERT_X), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T400s", HDAPS_ORIENT_INVERT_X), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T400", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T410s", HDAPS_ORIENT_SWAP), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T410", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T500", HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad T510", HDAPS_ORIENT_SWAP | HDAPS_ORIENT_INVERT_X | HDAPS_ORIENT_INVERT_Y), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad W51O", HDAPS_ORIENT_MAX), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X200s", HDAPS_ORIENT_SWAP | HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X200", HDAPS_ORIENT_SWAP | HDAPS_ORIENT_INVERT_X | HDAPS_ORIENT_INVERT_Y), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X201 Tablet", HDAPS_ORIENT_SWAP | HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X201s", HDAPS_ORIENT_SWAP | HDAPS_ORIENT_INVERT_XY), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X201", HDAPS_ORIENT_SWAP | HDAPS_ORIENT_INVERT_X), + HDAPS_DMI_MATCH_INVERT("LENOVO", "ThinkPad X220", HDAPS_ORIENT_SWAP), { .ident = NULL } }; static int __init hdaps_init(void) { - struct input_dev *idev; int ret; - if (!dmi_check_system(hdaps_whitelist)) { - pr_warn("supported laptop not found!\n"); - ret = -ENODEV; - goto out; - } - - if (!request_region(HDAPS_LOW_PORT, HDAPS_NR_PORTS, "hdaps")) { - ret = -ENXIO; - goto out; - } + /* Determine axis orientation orientation */ + if (hdaps_invert == HDAPS_ORIENT_UNDEFINED) /* set by module param? */ + if (dmi_check_system(hdaps_whitelist) < 1) /* in whitelist? */ + hdaps_invert = 0; /* default */ + /* Init timer before platform_driver_register, in case of suspend */ + init_timer(&hdaps_timer); + hdaps_timer.function = hdaps_mousedev_poll; ret = platform_driver_register(&hdaps_driver); if (ret) - goto out_region; + goto out; pdev = platform_device_register_simple("hdaps", -1, NULL, 0); if (IS_ERR(pdev)) { @@ -571,47 +792,79 @@ static int __init hdaps_init(void) if (ret) goto out_device; - hdaps_idev = input_allocate_polled_device(); + hdaps_idev = input_allocate_device(); if (!hdaps_idev) { ret = -ENOMEM; goto out_group; } - hdaps_idev->poll = hdaps_mousedev_poll; - hdaps_idev->poll_interval = HDAPS_POLL_INTERVAL; - - /* initial calibrate for the input device */ - hdaps_calibrate(); + hdaps_idev_raw = input_allocate_device(); + if (!hdaps_idev_raw) { + ret = -ENOMEM; + goto out_idev_first; + } - /* initialize the input class */ - idev = hdaps_idev->input; - idev->name = "hdaps"; - idev->phys = "isa1600/input0"; - idev->id.bustype = BUS_ISA; - idev->dev.parent = &pdev->dev; - idev->evbit[0] = BIT_MASK(EV_ABS); - input_set_abs_params(idev, ABS_X, + /* calibration for the input device (deferred to avoid delay) */ + needs_calibration = 1; + + /* initialize the joystick-like fuzzed input device */ + hdaps_idev->name = "ThinkPad HDAPS joystick emulation"; + hdaps_idev->phys = "hdaps/input0"; + hdaps_idev->id.bustype = BUS_HOST; + hdaps_idev->id.vendor = HDAPS_INPUT_VENDOR; + hdaps_idev->id.product = HDAPS_INPUT_PRODUCT; + hdaps_idev->id.version = HDAPS_INPUT_JS_VERSION; +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,25) + hdaps_idev->cdev.dev = &pdev->dev; +#endif + hdaps_idev->evbit[0] = BIT(EV_ABS); + hdaps_idev->open = hdaps_mousedev_open; + hdaps_idev->close = hdaps_mousedev_close; + input_set_abs_params(hdaps_idev, ABS_X, -256, 256, HDAPS_INPUT_FUZZ, HDAPS_INPUT_FLAT); - input_set_abs_params(idev, ABS_Y, + input_set_abs_params(hdaps_idev, ABS_Y, -256, 256, HDAPS_INPUT_FUZZ, HDAPS_INPUT_FLAT); - ret = input_register_polled_device(hdaps_idev); + ret = input_register_device(hdaps_idev); if (ret) goto out_idev; - pr_info("driver successfully loaded\n"); + /* initialize the raw data input device */ + hdaps_idev_raw->name = "ThinkPad HDAPS accelerometer data"; + hdaps_idev_raw->phys = "hdaps/input1"; + hdaps_idev_raw->id.bustype = BUS_HOST; + hdaps_idev_raw->id.vendor = HDAPS_INPUT_VENDOR; + hdaps_idev_raw->id.product = HDAPS_INPUT_PRODUCT; + hdaps_idev_raw->id.version = HDAPS_INPUT_RAW_VERSION; +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,25) + hdaps_idev_raw->cdev.dev = &pdev->dev; +#endif + hdaps_idev_raw->evbit[0] = BIT(EV_ABS); + hdaps_idev_raw->open = hdaps_mousedev_open; + hdaps_idev_raw->close = hdaps_mousedev_close; + input_set_abs_params(hdaps_idev_raw, ABS_X, -32768, 32767, 0, 0); + input_set_abs_params(hdaps_idev_raw, ABS_Y, -32768, 32767, 0, 0); + + ret = input_register_device(hdaps_idev_raw); + if (ret) + goto out_idev_reg_first; + + pr_info("driver successfully loaded.\n"); return 0; +out_idev_reg_first: + input_unregister_device(hdaps_idev); out_idev: - input_free_polled_device(hdaps_idev); + input_free_device(hdaps_idev_raw); +out_idev_first: + input_free_device(hdaps_idev); out_group: sysfs_remove_group(&pdev->dev.kobj, &hdaps_attribute_group); out_device: platform_device_unregister(pdev); out_driver: platform_driver_unregister(&hdaps_driver); -out_region: - release_region(HDAPS_LOW_PORT, HDAPS_NR_PORTS); + hdaps_device_shutdown(); out: pr_warn("driver init failed (ret=%d)!\n", ret); return ret; @@ -619,12 +872,12 @@ out: static void __exit hdaps_exit(void) { - input_unregister_polled_device(hdaps_idev); - input_free_polled_device(hdaps_idev); + input_unregister_device(hdaps_idev_raw); + input_unregister_device(hdaps_idev); + hdaps_device_shutdown(); /* ignore errors, effect is negligible */ sysfs_remove_group(&pdev->dev.kobj, &hdaps_attribute_group); platform_device_unregister(pdev); platform_driver_unregister(&hdaps_driver); - release_region(HDAPS_LOW_PORT, HDAPS_NR_PORTS); pr_info("driver unloaded\n"); } @@ -632,9 +885,8 @@ static void __exit hdaps_exit(void) module_init(hdaps_init); module_exit(hdaps_exit); -module_param_named(invert, hdaps_invert, int, 0); -MODULE_PARM_DESC(invert, "invert data along each axis. 1 invert x-axis, " - "2 invert y-axis, 3 invert both axes."); +module_param_named(invert, hdaps_invert, uint, 0); +MODULE_PARM_DESC(invert, "axis orientation code"); MODULE_AUTHOR("Robert Love"); MODULE_DESCRIPTION("IBM Hard Drive Active Protection System (HDAPS) driver"); diff --git b/drivers/platform/x86/thinkpad_ec.c b/drivers/platform/x86/thinkpad_ec.c new file mode 100644 index 0000000..ca34fed --- /dev/null +++ b/drivers/platform/x86/thinkpad_ec.c @@ -0,0 +1,513 @@ +/* + * thinkpad_ec.c - ThinkPad embedded controller LPC3 functions + * + * The embedded controller on ThinkPad laptops has a non-standard interface, + * where LPC channel 3 of the H8S EC chip is hooked up to IO ports + * 0x1600-0x161F and implements (a special case of) the H8S LPC protocol. + * The EC LPC interface provides various system management services (currently + * known: battery information and accelerometer readouts). This driver + * provides access and mutual exclusion for the EC interface. +* + * The LPC protocol and terminology are documented here: + * "H8S/2104B Group Hardware Manual", + * http://documentation.renesas.com/eng/products/mpumcu/rej09b0300_2140bhm.pdf + * + * Copyright (C) 2006-2007 Shem Multinymous + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,26) + #include +#else + #include +#endif + +#define TP_VERSION "0.41" + +MODULE_AUTHOR("Shem Multinymous"); +MODULE_DESCRIPTION("ThinkPad embedded controller hardware access"); +MODULE_VERSION(TP_VERSION); +MODULE_LICENSE("GPL"); + +/* IO ports used by embedded controller LPC channel 3: */ +#define TPC_BASE_PORT 0x1600 +#define TPC_NUM_PORTS 0x20 +#define TPC_STR3_PORT 0x1604 /* Reads H8S EC register STR3 */ +#define TPC_TWR0_PORT 0x1610 /* Mapped to H8S EC register TWR0MW/SW */ +#define TPC_TWR15_PORT 0x161F /* Mapped to H8S EC register TWR15. */ + /* (and port TPC_TWR0_PORT+i is mapped to H8S reg TWRi for 00x%02x", \ + msg, args->val[0x0], args->val[0xF], code) + +/* State of request prefetching: */ +static u8 prefetch_arg0, prefetch_argF; /* Args of last prefetch */ +static u64 prefetch_jiffies; /* time of prefetch, or: */ +#define TPC_PREFETCH_NONE INITIAL_JIFFIES /* No prefetch */ +#define TPC_PREFETCH_JUNK (INITIAL_JIFFIES+1) /* Ignore prefetch */ + +/* Locking: */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,37) +static DECLARE_MUTEX(thinkpad_ec_mutex); +#else +static DEFINE_SEMAPHORE(thinkpad_ec_mutex); +#endif + +/* Kludge in case the ACPI DSDT reserves the ports we need. */ +static bool force_io; /* Willing to do IO to ports we couldn't reserve? */ +static int reserved_io; /* Successfully reserved the ports? */ +module_param_named(force_io, force_io, bool, 0600); +MODULE_PARM_DESC(force_io, "Force IO even if region already reserved (0=off, 1=on)"); + +/** + * thinkpad_ec_lock - get lock on the ThinkPad EC + * + * Get exclusive lock for accesing the ThinkPad embedded controller LPC3 + * interface. Returns 0 iff lock acquired. + */ +int thinkpad_ec_lock(void) +{ + int ret; + ret = down_interruptible(&thinkpad_ec_mutex); + return ret; +} +EXPORT_SYMBOL_GPL(thinkpad_ec_lock); + +/** + * thinkpad_ec_try_lock - try getting lock on the ThinkPad EC + * + * Try getting an exclusive lock for accesing the ThinkPad embedded + * controller LPC3. Returns immediately if lock is not available; neither + * blocks nor sleeps. Returns 0 iff lock acquired . + */ +int thinkpad_ec_try_lock(void) +{ + return down_trylock(&thinkpad_ec_mutex); +} +EXPORT_SYMBOL_GPL(thinkpad_ec_try_lock); + +/** + * thinkpad_ec_unlock - release lock on ThinkPad EC + * + * Release a previously acquired exclusive lock on the ThinkPad ebmedded + * controller LPC3 interface. + */ +void thinkpad_ec_unlock(void) +{ + up(&thinkpad_ec_mutex); +} +EXPORT_SYMBOL_GPL(thinkpad_ec_unlock); + +/** + * thinkpad_ec_request_row - tell embedded controller to prepare a row + * @args Input register arguments + * + * Requests a data row by writing to H8S LPC registers TRW0 through TWR15 (or + * a subset thereof) following the protocol prescribed by the "H8S/2104B Group + * Hardware Manual". Does sanity checks via status register STR3. + */ +static int thinkpad_ec_request_row(const struct thinkpad_ec_row *args) +{ + u8 str3; + int i; + + /* EC protocol requires write to TWR0 (function code): */ + if (!(args->mask & 0x0001)) { + printk(KERN_ERR MSG_FMT("bad args->mask=0x%02x", args->mask)); + return -EINVAL; + } + + /* Check initial STR3 status: */ + str3 = inb(TPC_STR3_PORT) & H8S_STR3_MASK; + if (str3 & H8S_STR3_OBF3B) { /* data already pending */ + inb(TPC_TWR15_PORT); /* marks end of previous transaction */ + if (prefetch_jiffies == TPC_PREFETCH_NONE) + printk(KERN_WARNING REQ_FMT( + "EC has result from unrequested transaction", + str3)); + return -EBUSY; /* EC will be ready in a few usecs */ + } else if (str3 == H8S_STR3_SWMF) { /* busy with previous request */ + if (prefetch_jiffies == TPC_PREFETCH_NONE) + printk(KERN_WARNING REQ_FMT( + "EC is busy with unrequested transaction", + str3)); + return -EBUSY; /* data will be pending in a few usecs */ + } else if (str3 != 0x00) { /* unexpected status? */ + printk(KERN_WARNING REQ_FMT("unexpected initial STR3", str3)); + return -EIO; + } + + /* Send TWR0MW: */ + outb(args->val[0], TPC_TWR0_PORT); + str3 = inb(TPC_STR3_PORT) & H8S_STR3_MASK; + if (str3 != H8S_STR3_MWMF) { /* not accepted? */ + printk(KERN_WARNING REQ_FMT("arg0 rejected", str3)); + return -EIO; + } + + /* Send TWR1 through TWR14: */ + for (i = 1; i < TP_CONTROLLER_ROW_LEN-1; i++) + if ((args->mask>>i)&1) + outb(args->val[i], TPC_TWR0_PORT+i); + + /* Send TWR15 (default to 0x01). This marks end of command. */ + outb((args->mask & 0x8000) ? args->val[0xF] : 0x01, TPC_TWR15_PORT); + + /* Wait until EC starts writing its reply (~60ns on average). + * Releasing locks before this happens may cause an EC hang + * due to firmware bug! + */ + for (i = 0; i < TPC_REQUEST_RETRIES; i++) { + str3 = inb(TPC_STR3_PORT) & H8S_STR3_MASK; + if (str3 & H8S_STR3_SWMF) /* EC started replying */ + return 0; + else if (!(str3 & ~(H8S_STR3_IBF3B|H8S_STR3_MWMF))) + /* Normal progress (the EC hasn't seen the request + * yet, or is processing it). Wait it out. */ + ndelay(TPC_REQUEST_NDELAY); + else { /* weird EC status */ + printk(KERN_WARNING + REQ_FMT("bad end STR3", str3)); + return -EIO; + } + } + printk(KERN_WARNING REQ_FMT("EC is mysteriously silent", str3)); + return -EIO; +} + +/** + * thinkpad_ec_read_data - read pre-requested row-data from EC + * @args Input register arguments of pre-requested rows + * @data Output register values + * + * Reads current row data from the controller, assuming it's already + * requested. Follows the H8S spec for register access and status checks. + */ +static int thinkpad_ec_read_data(const struct thinkpad_ec_row *args, + struct thinkpad_ec_row *data) +{ + int i; + u8 str3 = inb(TPC_STR3_PORT) & H8S_STR3_MASK; + /* Once we make a request, STR3 assumes the sequence of values listed + * in the following 'if' as it reads the request and writes its data. + * It takes about a few dozen nanosecs total, with very high variance. + */ + if (str3 == (H8S_STR3_IBF3B|H8S_STR3_MWMF) || + str3 == 0x00 || /* the 0x00 is indistinguishable from idle EC! */ + str3 == H8S_STR3_SWMF) + return -EBUSY; /* not ready yet */ + /* Finally, the EC signals output buffer full: */ + if (str3 != (H8S_STR3_OBF3B|H8S_STR3_SWMF)) { + printk(KERN_WARNING + REQ_FMT("bad initial STR3", str3)); + return -EIO; + } + + /* Read first byte (signals start of read transactions): */ + data->val[0] = inb(TPC_TWR0_PORT); + /* Optionally read 14 more bytes: */ + for (i = 1; i < TP_CONTROLLER_ROW_LEN-1; i++) + if ((data->mask >> i)&1) + data->val[i] = inb(TPC_TWR0_PORT+i); + /* Read last byte from 0x161F (signals end of read transaction): */ + data->val[0xF] = inb(TPC_TWR15_PORT); + + /* Readout still pending? */ + str3 = inb(TPC_STR3_PORT) & H8S_STR3_MASK; + if (str3 & H8S_STR3_OBF3B) + printk(KERN_WARNING + REQ_FMT("OBF3B=1 after read", str3)); + /* If port 0x161F returns 0x80 too often, the EC may lock up. Warn: */ + if (data->val[0xF] == 0x80) + printk(KERN_WARNING + REQ_FMT("0x161F reports error", data->val[0xF])); + return 0; +} + +/** + * thinkpad_ec_is_row_fetched - is the given row currently prefetched? + * + * To keep things simple we compare only the first and last args; + * this suffices for all known cases. + */ +static int thinkpad_ec_is_row_fetched(const struct thinkpad_ec_row *args) +{ + return (prefetch_jiffies != TPC_PREFETCH_NONE) && + (prefetch_jiffies != TPC_PREFETCH_JUNK) && + (prefetch_arg0 == args->val[0]) && + (prefetch_argF == args->val[0xF]) && + (get_jiffies_64() < prefetch_jiffies + TPC_PREFETCH_TIMEOUT); +} + +/** + * thinkpad_ec_read_row - request and read data from ThinkPad EC + * @args Input register arguments + * @data Output register values + * + * Read a data row from the ThinkPad embedded controller LPC3 interface. + * Does fetching and retrying if needed. The row is specified by an + * array of 16 bytes, some of which may be undefined (but the first is + * mandatory). These bytes are given in @args->val[], where @args->val[i] is + * used iff (@args->mask>>i)&1). The resulting row data is stored in + * @data->val[], but is only guaranteed to be valid for indices corresponding + * to set bit in @data->mask. That is, if @data->mask&(1<val[i] is undefined. + * + * Returns -EBUSY on transient error and -EIO on abnormal condition. + * Caller must hold controller lock. + */ +int thinkpad_ec_read_row(const struct thinkpad_ec_row *args, + struct thinkpad_ec_row *data) +{ + int retries, ret; + + if (thinkpad_ec_is_row_fetched(args)) + goto read_row; /* already requested */ + + /* Request the row */ + for (retries = 0; retries < TPC_READ_RETRIES; ++retries) { + ret = thinkpad_ec_request_row(args); + if (!ret) + goto read_row; + if (ret != -EBUSY) + break; + ndelay(TPC_READ_NDELAY); + } + printk(KERN_ERR REQ_FMT("failed requesting row", ret)); + goto out; + +read_row: + /* Read the row's data */ + for (retries = 0; retries < TPC_READ_RETRIES; ++retries) { + ret = thinkpad_ec_read_data(args, data); + if (!ret) + goto out; + if (ret != -EBUSY) + break; + ndelay(TPC_READ_NDELAY); + } + + printk(KERN_ERR REQ_FMT("failed waiting for data", ret)); + +out: + prefetch_jiffies = TPC_PREFETCH_JUNK; + return ret; +} +EXPORT_SYMBOL_GPL(thinkpad_ec_read_row); + +/** + * thinkpad_ec_try_read_row - try reading prefetched data from ThinkPad EC + * @args Input register arguments + * @data Output register values + * + * Try reading a data row from the ThinkPad embedded controller LPC3 + * interface, if this raw was recently prefetched using + * thinkpad_ec_prefetch_row(). Does not fetch, retry or block. + * The parameters have the same meaning as in thinkpad_ec_read_row(). + * + * Returns -EBUSY is data not ready and -ENODATA if row not prefetched. + * Caller must hold controller lock. + */ +int thinkpad_ec_try_read_row(const struct thinkpad_ec_row *args, + struct thinkpad_ec_row *data) +{ + int ret; + if (!thinkpad_ec_is_row_fetched(args)) { + ret = -ENODATA; + } else { + ret = thinkpad_ec_read_data(args, data); + if (!ret) + prefetch_jiffies = TPC_PREFETCH_NONE; /* eaten up */ + } + return ret; +} +EXPORT_SYMBOL_GPL(thinkpad_ec_try_read_row); + +/** + * thinkpad_ec_prefetch_row - prefetch data from ThinkPad EC + * @args Input register arguments + * + * Prefetch a data row from the ThinkPad embedded controller LCP3 + * interface. A subsequent call to thinkpad_ec_read_row() with the + * same arguments will be faster, and a subsequent call to + * thinkpad_ec_try_read_row() stands a good chance of succeeding if + * done neither too soon nor too late. See + * thinkpad_ec_read_row() for the meaning of @args. + * + * Returns -EBUSY on transient error and -EIO on abnormal condition. + * Caller must hold controller lock. + */ +int thinkpad_ec_prefetch_row(const struct thinkpad_ec_row *args) +{ + int ret; + ret = thinkpad_ec_request_row(args); + if (ret) { + prefetch_jiffies = TPC_PREFETCH_JUNK; + } else { + prefetch_jiffies = get_jiffies_64(); + prefetch_arg0 = args->val[0x0]; + prefetch_argF = args->val[0xF]; + } + return ret; +} +EXPORT_SYMBOL_GPL(thinkpad_ec_prefetch_row); + +/** + * thinkpad_ec_invalidate - invalidate prefetched ThinkPad EC data + * + * Invalidate the data prefetched via thinkpad_ec_prefetch_row() from the + * ThinkPad embedded controller LPC3 interface. + * Must be called before unlocking by any code that accesses the controller + * ports directly. + */ +void thinkpad_ec_invalidate(void) +{ + prefetch_jiffies = TPC_PREFETCH_JUNK; +} +EXPORT_SYMBOL_GPL(thinkpad_ec_invalidate); + + +/*** Checking for EC hardware ***/ + +/** + * thinkpad_ec_test - verify the EC is present and follows protocol + * + * Ensure the EC LPC3 channel really works on this machine by making + * an EC request and seeing if the EC follows the documented H8S protocol. + * The requested row just reads battery status, so it should be harmless to + * access it (on a correct EC). + * This test writes to IO ports, so execute only after checking DMI. + */ +static int __init thinkpad_ec_test(void) +{ + int ret; + const struct thinkpad_ec_row args = /* battery 0 basic status */ + { .mask = 0x8001, .val = {0x01,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0x00} }; + struct thinkpad_ec_row data = { .mask = 0x0000 }; + ret = thinkpad_ec_lock(); + if (ret) + return ret; + ret = thinkpad_ec_read_row(&args, &data); + thinkpad_ec_unlock(); + return ret; +} + +/* Search all DMI device names of a given type for a substring */ +static int __init dmi_find_substring(int type, const char *substr) +{ + const struct dmi_device *dev = NULL; + while ((dev = dmi_find_device(type, NULL, dev))) { + if (strstr(dev->name, substr)) + return 1; + } + return 0; +} + +#define TP_DMI_MATCH(vendor,model) { \ + .ident = vendor " " model, \ + .matches = { \ + DMI_MATCH(DMI_BOARD_VENDOR, vendor), \ + DMI_MATCH(DMI_PRODUCT_VERSION, model) \ + } \ +} + +/* Check DMI for existence of ThinkPad embedded controller */ +static int __init check_dmi_for_ec(void) +{ + /* A few old models that have a good EC but don't report it in DMI */ + struct dmi_system_id tp_whitelist[] = { + TP_DMI_MATCH("IBM", "ThinkPad A30"), + TP_DMI_MATCH("IBM", "ThinkPad T23"), + TP_DMI_MATCH("IBM", "ThinkPad X24"), + TP_DMI_MATCH("LENOVO", "ThinkPad"), + { .ident = NULL } + }; + return dmi_find_substring(DMI_DEV_TYPE_OEM_STRING, + "IBM ThinkPad Embedded Controller") || + dmi_check_system(tp_whitelist); +} + +/*** Init and cleanup ***/ + +static int __init thinkpad_ec_init(void) +{ + if (!check_dmi_for_ec()) { + printk(KERN_WARNING + "thinkpad_ec: no ThinkPad embedded controller!\n"); + return -ENODEV; + } + + if (request_region(TPC_BASE_PORT, TPC_NUM_PORTS, "thinkpad_ec")) { + reserved_io = 1; + } else { + printk(KERN_ERR "thinkpad_ec: cannot claim IO ports %#x-%#x... ", + TPC_BASE_PORT, + TPC_BASE_PORT + TPC_NUM_PORTS - 1); + if (force_io) { + printk("forcing use of unreserved IO ports.\n"); + } else { + printk("consider using force_io=1.\n"); + return -ENXIO; + } + } + prefetch_jiffies = TPC_PREFETCH_JUNK; + if (thinkpad_ec_test()) { + printk(KERN_ERR "thinkpad_ec: initial ec test failed\n"); + if (reserved_io) + release_region(TPC_BASE_PORT, TPC_NUM_PORTS); + return -ENXIO; + } + printk(KERN_INFO "thinkpad_ec: thinkpad_ec " TP_VERSION " loaded.\n"); + return 0; +} + +static void __exit thinkpad_ec_exit(void) +{ + if (reserved_io) + release_region(TPC_BASE_PORT, TPC_NUM_PORTS); + printk(KERN_INFO "thinkpad_ec: unloaded.\n"); +} + +module_init(thinkpad_ec_init); +module_exit(thinkpad_ec_exit); diff --git b/drivers/platform/x86/tp_smapi.c b/drivers/platform/x86/tp_smapi.c new file mode 100644 index 0000000..747c3d7 --- /dev/null +++ b/drivers/platform/x86/tp_smapi.c @@ -0,0 +1,1493 @@ +/* + * tp_smapi.c - ThinkPad SMAPI support + * + * This driver exposes some features of the System Management Application + * Program Interface (SMAPI) BIOS found on ThinkPad laptops. It works on + * models in which the SMAPI BIOS runs in SMM and is invoked by writing + * to the APM control port 0xB2. + * It also exposes battery status information, obtained from the ThinkPad + * embedded controller (via the thinkpad_ec module). + * Ancient ThinkPad models use a different interface, supported by the + * "thinkpad" module from "tpctl". + * + * Many of the battery status values obtained from the EC simply mirror + * values provided by the battery's Smart Battery System (SBS) interface, so + * their meaning is defined by the Smart Battery Data Specification (see + * http://sbs-forum.org/specs/sbdat110.pdf). References to this SBS spec + * are given in the code where relevant. + * + * Copyright (C) 2006 Shem Multinymous . + * SMAPI access code based on the mwave driver by Mike Sullivan. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include /* CMOS defines */ +#include +#include +#include +#include +#include +#include + +#define TP_VERSION "0.41" +#define TP_DESC "ThinkPad SMAPI Support" +#define TP_DIR "smapi" + +MODULE_AUTHOR("Shem Multinymous"); +MODULE_DESCRIPTION(TP_DESC); +MODULE_VERSION(TP_VERSION); +MODULE_LICENSE("GPL"); + +static struct platform_device *pdev; + +static int tp_debug; +module_param_named(debug, tp_debug, int, 0600); +MODULE_PARM_DESC(debug, "Debug level (0=off, 1=on)"); + +/* A few macros for printk()ing: */ +#define TPRINTK(level, fmt, args...) \ + dev_printk(level, &(pdev->dev), "%s: " fmt "\n", __func__, ## args) +#define DPRINTK(fmt, args...) \ + do { if (tp_debug) TPRINTK(KERN_DEBUG, fmt, ## args); } while (0) + +/********************************************************************* + * SMAPI interface + */ + +/* SMAPI functions (register BX when making the SMM call). */ +#define SMAPI_GET_INHIBIT_CHARGE 0x2114 +#define SMAPI_SET_INHIBIT_CHARGE 0x2115 +#define SMAPI_GET_THRESH_START 0x2116 +#define SMAPI_SET_THRESH_START 0x2117 +#define SMAPI_GET_FORCE_DISCHARGE 0x2118 +#define SMAPI_SET_FORCE_DISCHARGE 0x2119 +#define SMAPI_GET_THRESH_STOP 0x211a +#define SMAPI_SET_THRESH_STOP 0x211b + +/* SMAPI error codes (see ThinkPad 770 Technical Reference Manual p.83 at + http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=PFAN-3TUQQD */ +#define SMAPI_RETCODE_EOF 0xff +static struct { u8 rc; char *msg; int ret; } smapi_retcode[] = +{ + {0x00, "OK", 0}, + {0x53, "SMAPI function is not available", -ENXIO}, + {0x81, "Invalid parameter", -EINVAL}, + {0x86, "Function is not supported by SMAPI BIOS", -EOPNOTSUPP}, + {0x90, "System error", -EIO}, + {0x91, "System is invalid", -EIO}, + {0x92, "System is busy, -EBUSY"}, + {0xa0, "Device error (disk read error)", -EIO}, + {0xa1, "Device is busy", -EBUSY}, + {0xa2, "Device is not attached", -ENXIO}, + {0xa3, "Device is disbled", -EIO}, + {0xa4, "Request parameter is out of range", -EINVAL}, + {0xa5, "Request parameter is not accepted", -EINVAL}, + {0xa6, "Transient error", -EBUSY}, /* ? */ + {SMAPI_RETCODE_EOF, "Unknown error code", -EIO} +}; + + +#define SMAPI_MAX_RETRIES 10 +#define SMAPI_PORT2 0x4F /* fixed port, meaning unclear */ +static unsigned short smapi_port; /* APM control port, normally 0xB2 */ + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,37) +static DECLARE_MUTEX(smapi_mutex); +#else +static DEFINE_SEMAPHORE(smapi_mutex); +#endif + +/** + * find_smapi_port - read SMAPI port from NVRAM + */ +static int __init find_smapi_port(void) +{ + u16 smapi_id = 0; + unsigned short port = 0; + unsigned long flags; + + spin_lock_irqsave(&rtc_lock, flags); + smapi_id = CMOS_READ(0x7C); + smapi_id |= (CMOS_READ(0x7D) << 8); + spin_unlock_irqrestore(&rtc_lock, flags); + + if (smapi_id != 0x5349) { + printk(KERN_ERR "SMAPI not supported (ID=0x%x)\n", smapi_id); + return -ENXIO; + } + spin_lock_irqsave(&rtc_lock, flags); + port = CMOS_READ(0x7E); + port |= (CMOS_READ(0x7F) << 8); + spin_unlock_irqrestore(&rtc_lock, flags); + if (port == 0) { + printk(KERN_ERR "unable to read SMAPI port number\n"); + return -ENXIO; + } + return port; +} + +/** + * smapi_request - make a SMAPI call + * @inEBX, @inECX, @inEDI, @inESI: input registers + * @outEBX, @outECX, @outEDX, @outEDI, @outESI: outputs registers + * @msg: textual error message + * Invokes the SMAPI SMBIOS with the given input and outpu args. + * All outputs are optional (can be %NULL). + * Returns 0 when successful, and a negative errno constant + * (see smapi_retcode above) upon failure. + */ +static int smapi_request(u32 inEBX, u32 inECX, + u32 inEDI, u32 inESI, + u32 *outEBX, u32 *outECX, u32 *outEDX, + u32 *outEDI, u32 *outESI, const char **msg) +{ + int ret = 0; + int i; + int retries; + u8 rc; + /* Must use local vars for output regs, due to reg pressure. */ + u32 tmpEAX, tmpEBX, tmpECX, tmpEDX, tmpEDI, tmpESI; + + for (retries = 0; retries < SMAPI_MAX_RETRIES; ++retries) { + DPRINTK("req_in: BX=%x CX=%x DI=%x SI=%x", + inEBX, inECX, inEDI, inESI); + + /* SMAPI's SMBIOS call and thinkpad_ec end up using use + * different interfaces to the same chip, so play it safe. */ + ret = thinkpad_ec_lock(); + if (ret) + return ret; + + __asm__ __volatile__( + "movl $0x00005380,%%eax\n\t" + "movl %6,%%ebx\n\t" + "movl %7,%%ecx\n\t" + "movl %8,%%edi\n\t" + "movl %9,%%esi\n\t" + "xorl %%edx,%%edx\n\t" + "movw %10,%%dx\n\t" + "out %%al,%%dx\n\t" /* trigger SMI to SMBIOS */ + "out %%al,$0x4F\n\t" + "movl %%eax,%0\n\t" + "movl %%ebx,%1\n\t" + "movl %%ecx,%2\n\t" + "movl %%edx,%3\n\t" + "movl %%edi,%4\n\t" + "movl %%esi,%5\n\t" + :"=m"(tmpEAX), + "=m"(tmpEBX), + "=m"(tmpECX), + "=m"(tmpEDX), + "=m"(tmpEDI), + "=m"(tmpESI) + :"m"(inEBX), "m"(inECX), "m"(inEDI), "m"(inESI), + "m"((u16)smapi_port) + :"%eax", "%ebx", "%ecx", "%edx", "%edi", + "%esi"); + + thinkpad_ec_invalidate(); + thinkpad_ec_unlock(); + + /* Don't let the next SMAPI access happen too quickly, + * may case problems. (We're hold smapi_mutex). */ + msleep(50); + + if (outEBX) *outEBX = tmpEBX; + if (outECX) *outECX = tmpECX; + if (outEDX) *outEDX = tmpEDX; + if (outESI) *outESI = tmpESI; + if (outEDI) *outEDI = tmpEDI; + + /* Look up error code */ + rc = (tmpEAX>>8)&0xFF; + for (i = 0; smapi_retcode[i].rc != SMAPI_RETCODE_EOF && + smapi_retcode[i].rc != rc; ++i) {} + ret = smapi_retcode[i].ret; + if (msg) + *msg = smapi_retcode[i].msg; + + DPRINTK("req_out: AX=%x BX=%x CX=%x DX=%x DI=%x SI=%x r=%d", + tmpEAX, tmpEBX, tmpECX, tmpEDX, tmpEDI, tmpESI, ret); + if (ret) + TPRINTK(KERN_NOTICE, "SMAPI error: %s (func=%x)", + smapi_retcode[i].msg, inEBX); + + if (ret != -EBUSY) + return ret; + } + return ret; +} + +/* Convenience wrapper: discard output arguments */ +static int smapi_write(u32 inEBX, u32 inECX, + u32 inEDI, u32 inESI, const char **msg) +{ + return smapi_request(inEBX, inECX, inEDI, inESI, + NULL, NULL, NULL, NULL, NULL, msg); +} + + +/********************************************************************* + * Specific SMAPI services + * All of these functions return 0 upon success, and a negative errno + * constant (see smapi_retcode) on failure. + */ + +enum thresh_type { + THRESH_STOP = 0, /* the code assumes this is 0 for brevity */ + THRESH_START +}; +#define THRESH_NAME(which) ((which == THRESH_START) ? "start" : "stop") + +/** + * __get_real_thresh - read battery charge start/stop threshold from SMAPI + * @bat: battery number (0 or 1) + * @which: THRESH_START or THRESH_STOP + * @thresh: 1..99, 0=default 1..99, 0=default (pass this as-is to SMAPI) + * @outEDI: some additional state that needs to be preserved, meaning unknown + * @outESI: some additional state that needs to be preserved, meaning unknown + */ +static int __get_real_thresh(int bat, enum thresh_type which, int *thresh, + u32 *outEDI, u32 *outESI) +{ + u32 ebx = (which == THRESH_START) ? SMAPI_GET_THRESH_START + : SMAPI_GET_THRESH_STOP; + u32 ecx = (bat+1)<<8; + const char *msg; + int ret = smapi_request(ebx, ecx, 0, 0, NULL, + &ecx, NULL, outEDI, outESI, &msg); + if (ret) { + TPRINTK(KERN_NOTICE, "cannot get %s_thresh of bat=%d: %s", + THRESH_NAME(which), bat, msg); + return ret; + } + if (!(ecx&0x00000100)) { + TPRINTK(KERN_NOTICE, "cannot get %s_thresh of bat=%d: ecx=0%x", + THRESH_NAME(which), bat, ecx); + return -EIO; + } + if (thresh) + *thresh = ecx&0xFF; + return 0; +} + +/** + * get_real_thresh - read battery charge start/stop threshold from SMAPI + * @bat: battery number (0 or 1) + * @which: THRESH_START or THRESH_STOP + * @thresh: 1..99, 0=default (passes as-is to SMAPI) + */ +static int get_real_thresh(int bat, enum thresh_type which, int *thresh) +{ + return __get_real_thresh(bat, which, thresh, NULL, NULL); +} + +/** + * set_real_thresh - write battery start/top charge threshold to SMAPI + * @bat: battery number (0 or 1) + * @which: THRESH_START or THRESH_STOP + * @thresh: 1..99, 0=default (passes as-is to SMAPI) + */ +static int set_real_thresh(int bat, enum thresh_type which, int thresh) +{ + u32 ebx = (which == THRESH_START) ? SMAPI_SET_THRESH_START + : SMAPI_SET_THRESH_STOP; + u32 ecx = ((bat+1)<<8) + thresh; + u32 getDI, getSI; + const char *msg; + int ret; + + /* verify read before writing */ + ret = __get_real_thresh(bat, which, NULL, &getDI, &getSI); + if (ret) + return ret; + + ret = smapi_write(ebx, ecx, getDI, getSI, &msg); + if (ret) + TPRINTK(KERN_NOTICE, "set %s to %d for bat=%d failed: %s", + THRESH_NAME(which), thresh, bat, msg); + else + TPRINTK(KERN_INFO, "set %s to %d for bat=%d", + THRESH_NAME(which), thresh, bat); + return ret; +} + +/** + * __get_inhibit_charge_minutes - get inhibit charge period from SMAPI + * @bat: battery number (0 or 1) + * @minutes: period in minutes (1..65535 minutes, 0=disabled) + * @outECX: some additional state that needs to be preserved, meaning unknown + * Note that @minutes is the originally set value, it does not count down. + */ +static int __get_inhibit_charge_minutes(int bat, int *minutes, u32 *outECX) +{ + u32 ecx = (bat+1)<<8; + u32 esi; + const char *msg; + int ret = smapi_request(SMAPI_GET_INHIBIT_CHARGE, ecx, 0, 0, + NULL, &ecx, NULL, NULL, &esi, &msg); + if (ret) { + TPRINTK(KERN_NOTICE, "failed for bat=%d: %s", bat, msg); + return ret; + } + if (!(ecx&0x0100)) { + TPRINTK(KERN_NOTICE, "bad ecx=0x%x for bat=%d", ecx, bat); + return -EIO; + } + if (minutes) + *minutes = (ecx&0x0001)?esi:0; + if (outECX) + *outECX = ecx; + return 0; +} + +/** + * get_inhibit_charge_minutes - get inhibit charge period from SMAPI + * @bat: battery number (0 or 1) + * @minutes: period in minutes (1..65535 minutes, 0=disabled) + * Note that @minutes is the originally set value, it does not count down. + */ +static int get_inhibit_charge_minutes(int bat, int *minutes) +{ + return __get_inhibit_charge_minutes(bat, minutes, NULL); +} + +/** + * set_inhibit_charge_minutes - write inhibit charge period to SMAPI + * @bat: battery number (0 or 1) + * @minutes: period in minutes (1..65535 minutes, 0=disabled) + */ +static int set_inhibit_charge_minutes(int bat, int minutes) +{ + u32 ecx; + const char *msg; + int ret; + + /* verify read before writing */ + ret = __get_inhibit_charge_minutes(bat, NULL, &ecx); + if (ret) + return ret; + + ecx = ((bat+1)<<8) | (ecx&0x00FE) | (minutes > 0 ? 0x0001 : 0x0000); + if (minutes > 0xFFFF) + minutes = 0xFFFF; + ret = smapi_write(SMAPI_SET_INHIBIT_CHARGE, ecx, 0, minutes, &msg); + if (ret) + TPRINTK(KERN_NOTICE, + "set to %d failed for bat=%d: %s", minutes, bat, msg); + else + TPRINTK(KERN_INFO, "set to %d for bat=%d\n", minutes, bat); + return ret; +} + + +/** + * get_force_discharge - get status of forced discharging from SMAPI + * @bat: battery number (0 or 1) + * @enabled: 1 if forced discharged is enabled, 0 if not + */ +static int get_force_discharge(int bat, int *enabled) +{ + u32 ecx = (bat+1)<<8; + const char *msg; + int ret = smapi_request(SMAPI_GET_FORCE_DISCHARGE, ecx, 0, 0, + NULL, &ecx, NULL, NULL, NULL, &msg); + if (ret) { + TPRINTK(KERN_NOTICE, "failed for bat=%d: %s", bat, msg); + return ret; + } + *enabled = (!(ecx&0x00000100) && (ecx&0x00000001))?1:0; + return 0; +} + +/** + * set_force_discharge - write status of forced discharging to SMAPI + * @bat: battery number (0 or 1) + * @enabled: 1 if forced discharged is enabled, 0 if not + */ +static int set_force_discharge(int bat, int enabled) +{ + u32 ecx = (bat+1)<<8; + const char *msg; + int ret = smapi_request(SMAPI_GET_FORCE_DISCHARGE, ecx, 0, 0, + NULL, &ecx, NULL, NULL, NULL, &msg); + if (ret) { + TPRINTK(KERN_NOTICE, "get failed for bat=%d: %s", bat, msg); + return ret; + } + if (ecx&0x00000100) { + TPRINTK(KERN_NOTICE, "cannot force discharge bat=%d", bat); + return -EIO; + } + + ecx = ((bat+1)<<8) | (ecx&0x000000FA) | (enabled?0x00000001:0); + ret = smapi_write(SMAPI_SET_FORCE_DISCHARGE, ecx, 0, 0, &msg); + if (ret) + TPRINTK(KERN_NOTICE, "set to %d failed for bat=%d: %s", + enabled, bat, msg); + else + TPRINTK(KERN_INFO, "set to %d for bat=%d", enabled, bat); + return ret; +} + + +/********************************************************************* + * Wrappers to threshold-related SMAPI functions, which handle default + * thresholds and related quirks. + */ + +/* Minimum, default and minimum difference for battery charging thresholds: */ +#define MIN_THRESH_DELTA 4 /* Min delta between start and stop thresh */ +#define MIN_THRESH_START 2 +#define MAX_THRESH_START (100-MIN_THRESH_DELTA) +#define MIN_THRESH_STOP (MIN_THRESH_START + MIN_THRESH_DELTA) +#define MAX_THRESH_STOP 100 +#define DEFAULT_THRESH_START MAX_THRESH_START +#define DEFAULT_THRESH_STOP MAX_THRESH_STOP + +/* The GUI of IBM's Battery Maximizer seems to show a start threshold that + * is 1 more than the value we set/get via SMAPI. Since the threshold is + * maintained across reboot, this can be confusing. So we kludge our + * interface for interoperability: */ +#define BATMAX_FIX 1 + +/* Get charge start/stop threshold (1..100), + * substituting default values if needed and applying BATMAT_FIX. */ +static int get_thresh(int bat, enum thresh_type which, int *thresh) +{ + int ret = get_real_thresh(bat, which, thresh); + if (ret) + return ret; + if (*thresh == 0) + *thresh = (which == THRESH_START) ? DEFAULT_THRESH_START + : DEFAULT_THRESH_STOP; + else if (which == THRESH_START) + *thresh += BATMAX_FIX; + return 0; +} + + +/* Set charge start/stop threshold (1..100), + * substituting default values if needed and applying BATMAT_FIX. */ +static int set_thresh(int bat, enum thresh_type which, int thresh) +{ + if (which == THRESH_STOP && thresh == DEFAULT_THRESH_STOP) + thresh = 0; /* 100 is out of range, but default means 100 */ + if (which == THRESH_START) + thresh -= BATMAX_FIX; + return set_real_thresh(bat, which, thresh); +} + +/********************************************************************* + * ThinkPad embedded controller readout and basic functions + */ + +/** + * read_tp_ec_row - read data row from the ThinkPad embedded controller + * @arg0: EC command code + * @bat: battery number, 0 or 1 + * @j: the byte value to be used for "junk" (unused) input/outputs + * @dataval: result vector + */ +static int read_tp_ec_row(u8 arg0, int bat, u8 j, u8 *dataval) +{ + int ret; + const struct thinkpad_ec_row args = { .mask = 0xFFFF, + .val = {arg0, j,j,j,j,j,j,j,j,j,j,j,j,j,j, (u8)bat} }; + struct thinkpad_ec_row data = { .mask = 0xFFFF }; + + ret = thinkpad_ec_lock(); + if (ret) + return ret; + ret = thinkpad_ec_read_row(&args, &data); + thinkpad_ec_unlock(); + memcpy(dataval, &data.val, TP_CONTROLLER_ROW_LEN); + return ret; +} + +/** + * power_device_present - check for presence of battery or AC power + * @bat: 0 for battery 0, 1 for battery 1, otherwise AC power + * Returns 1 if present, 0 if not present, negative if error. + */ +static int power_device_present(int bat) +{ + u8 row[TP_CONTROLLER_ROW_LEN]; + u8 test; + int ret = read_tp_ec_row(1, bat, 0, row); + if (ret) + return ret; + switch (bat) { + case 0: test = 0x40; break; /* battery 0 */ + case 1: test = 0x20; break; /* battery 1 */ + default: test = 0x80; /* AC power */ + } + return (row[0] & test) ? 1 : 0; +} + +/** + * bat_has_status - check if battery can report detailed status + * @bat: 0 for battery 0, 1 for battery 1 + * Returns 1 if yes, 0 if no, negative if error. + */ +static int bat_has_status(int bat) +{ + u8 row[TP_CONTROLLER_ROW_LEN]; + int ret = read_tp_ec_row(1, bat, 0, row); + if (ret) + return ret; + if ((row[0] & (bat?0x20:0x40)) == 0) /* no battery */ + return 0; + if ((row[1] & (0x60)) == 0) /* no status */ + return 0; + return 1; +} + +/** + * get_tp_ec_bat_16 - read a 16-bit value from EC battery status data + * @arg0: first argument to EC + * @off: offset in row returned from EC + * @bat: battery (0 or 1) + * @val: the 16-bit value obtained + * Returns nonzero on error. + */ +static int get_tp_ec_bat_16(u8 arg0, int offset, int bat, u16 *val) +{ + u8 row[TP_CONTROLLER_ROW_LEN]; + int ret; + if (bat_has_status(bat) != 1) + return -ENXIO; + ret = read_tp_ec_row(arg0, bat, 0, row); + if (ret) + return ret; + *val = *(u16 *)(row+offset); + return 0; +} + +/********************************************************************* + * sysfs attributes for batteries - + * definitions and helper functions + */ + +/* A custom device attribute struct which holds a battery number */ +struct bat_device_attribute { + struct device_attribute dev_attr; + int bat; +}; + +/** + * attr_get_bat - get the battery to which the attribute belongs + */ +static int attr_get_bat(struct device_attribute *attr) +{ + return container_of(attr, struct bat_device_attribute, dev_attr)->bat; +} + +/** + * show_tp_ec_bat_u16 - show an unsigned 16-bit battery attribute + * @arg0: specified 1st argument of EC raw to read + * @offset: byte offset in EC raw data + * @mul: correction factor to multiply by + * @na_msg: string to output is value not available (0xFFFFFFFF) + * @attr: battery attribute + * @buf: output buffer + * The 16-bit value is read from the EC, treated as unsigned, + * transformed as x->mul*x, and printed to the buffer. + * If the value is 0xFFFFFFFF and na_msg!=%NULL, na_msg is printed instead. + */ +static ssize_t show_tp_ec_bat_u16(u8 arg0, int offset, int mul, + const char *na_msg, + struct device_attribute *attr, char *buf) +{ + u16 val; + int ret = get_tp_ec_bat_16(arg0, offset, attr_get_bat(attr), &val); + if (ret) + return ret; + if (na_msg && val == 0xFFFF) + return sprintf(buf, "%s\n", na_msg); + else + return sprintf(buf, "%u\n", mul*(unsigned int)val); +} + +/** + * show_tp_ec_bat_s16 - show an signed 16-bit battery attribute + * @arg0: specified 1st argument of EC raw to read + * @offset: byte offset in EC raw data + * @mul: correction factor to multiply by + * @add: correction term to add after multiplication + * @attr: battery attribute + * @buf: output buffer + * The 16-bit value is read from the EC, treated as signed, + * transformed as x->mul*x+add, and printed to the buffer. + */ +static ssize_t show_tp_ec_bat_s16(u8 arg0, int offset, int mul, int add, + struct device_attribute *attr, char *buf) +{ + u16 val; + int ret = get_tp_ec_bat_16(arg0, offset, attr_get_bat(attr), &val); + if (ret) + return ret; + return sprintf(buf, "%d\n", mul*(s16)val+add); +} + +/** + * show_tp_ec_bat_str - show a string from EC battery status data + * @arg0: specified 1st argument of EC raw to read + * @offset: byte offset in EC raw data + * @maxlen: maximum string length + * @attr: battery attribute + * @buf: output buffer + */ +static ssize_t show_tp_ec_bat_str(u8 arg0, int offset, int maxlen, + struct device_attribute *attr, char *buf) +{ + int bat = attr_get_bat(attr); + u8 row[TP_CONTROLLER_ROW_LEN]; + int ret; + if (bat_has_status(bat) != 1) + return -ENXIO; + ret = read_tp_ec_row(arg0, bat, 0, row); + if (ret) + return ret; + strncpy(buf, (char *)row+offset, maxlen); + buf[maxlen] = 0; + strcat(buf, "\n"); + return strlen(buf); +} + +/** + * show_tp_ec_bat_power - show a power readout from EC battery status data + * @arg0: specified 1st argument of EC raw to read + * @offV: byte offset of voltage in EC raw data + * @offI: byte offset of current in EC raw data + * @attr: battery attribute + * @buf: output buffer + * Computes the power as current*voltage from the two given readout offsets. + */ +static ssize_t show_tp_ec_bat_power(u8 arg0, int offV, int offI, + struct device_attribute *attr, char *buf) +{ + u8 row[TP_CONTROLLER_ROW_LEN]; + int milliamp, millivolt, ret; + int bat = attr_get_bat(attr); + if (bat_has_status(bat) != 1) + return -ENXIO; + ret = read_tp_ec_row(1, bat, 0, row); + if (ret) + return ret; + millivolt = *(u16 *)(row+offV); + milliamp = *(s16 *)(row+offI); + return sprintf(buf, "%d\n", milliamp*millivolt/1000); /* units: mW */ +} + +/** + * show_tp_ec_bat_date - decode and show a date from EC battery status data + * @arg0: specified 1st argument of EC raw to read + * @offset: byte offset in EC raw data + * @attr: battery attribute + * @buf: output buffer + */ +static ssize_t show_tp_ec_bat_date(u8 arg0, int offset, + struct device_attribute *attr, char *buf) +{ + u8 row[TP_CONTROLLER_ROW_LEN]; + u16 v; + int ret; + int day, month, year; + int bat = attr_get_bat(attr); + if (bat_has_status(bat) != 1) + return -ENXIO; + ret = read_tp_ec_row(arg0, bat, 0, row); + if (ret) + return ret; + + /* Decode bit-packed: v = day | (month<<5) | ((year-1980)<<9) */ + v = *(u16 *)(row+offset); + day = v & 0x1F; + month = (v >> 5) & 0xF; + year = (v >> 9) + 1980; + + return sprintf(buf, "%04d-%02d-%02d\n", year, month, day); +} + + +/********************************************************************* + * sysfs attribute I/O for batteries - + * the actual attribute show/store functions + */ + +static ssize_t show_battery_start_charge_thresh(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int thresh; + int bat = attr_get_bat(attr); + int ret = get_thresh(bat, THRESH_START, &thresh); + if (ret) + return ret; + return sprintf(buf, "%d\n", thresh); /* units: percent */ +} + +static ssize_t show_battery_stop_charge_thresh(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int thresh; + int bat = attr_get_bat(attr); + int ret = get_thresh(bat, THRESH_STOP, &thresh); + if (ret) + return ret; + return sprintf(buf, "%d\n", thresh); /* units: percent */ +} + +/** + * store_battery_start_charge_thresh - store battery_start_charge_thresh attr + * Since this is a kernel<->user interface, we ensure a valid state for + * the hardware. We do this by clamping the requested threshold to the + * valid range and, if necessary, moving the other threshold so that + * it's MIN_THRESH_DELTA away from this one. + */ +static ssize_t store_battery_start_charge_thresh(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + int thresh, other_thresh, ret; + int bat = attr_get_bat(attr); + + if (sscanf(buf, "%d", &thresh) != 1 || thresh < 1 || thresh > 100) + return -EINVAL; + + if (thresh < MIN_THRESH_START) /* clamp up to MIN_THRESH_START */ + thresh = MIN_THRESH_START; + if (thresh > MAX_THRESH_START) /* clamp down to MAX_THRESH_START */ + thresh = MAX_THRESH_START; + + down(&smapi_mutex); + ret = get_thresh(bat, THRESH_STOP, &other_thresh); + if (ret != -EOPNOTSUPP && ret != -ENXIO) { + if (ret) /* other threshold is set? */ + goto out; + ret = get_real_thresh(bat, THRESH_START, NULL); + if (ret) /* this threshold is set? */ + goto out; + if (other_thresh < thresh+MIN_THRESH_DELTA) { + /* move other thresh to keep it above this one */ + ret = set_thresh(bat, THRESH_STOP, + thresh+MIN_THRESH_DELTA); + if (ret) + goto out; + } + } + ret = set_thresh(bat, THRESH_START, thresh); +out: + up(&smapi_mutex); + return count; + +} + +/** + * store_battery_stop_charge_thresh - store battery_stop_charge_thresh attr + * Since this is a kernel<->user interface, we ensure a valid state for + * the hardware. We do this by clamping the requested threshold to the + * valid range and, if necessary, moving the other threshold so that + * it's MIN_THRESH_DELTA away from this one. + */ +static ssize_t store_battery_stop_charge_thresh(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + int thresh, other_thresh, ret; + int bat = attr_get_bat(attr); + + if (sscanf(buf, "%d", &thresh) != 1 || thresh < 1 || thresh > 100) + return -EINVAL; + + if (thresh < MIN_THRESH_STOP) /* clamp up to MIN_THRESH_STOP */ + thresh = MIN_THRESH_STOP; + + down(&smapi_mutex); + ret = get_thresh(bat, THRESH_START, &other_thresh); + if (ret != -EOPNOTSUPP && ret != -ENXIO) { /* other threshold exists? */ + if (ret) + goto out; + /* this threshold exists? */ + ret = get_real_thresh(bat, THRESH_STOP, NULL); + if (ret) + goto out; + if (other_thresh >= thresh-MIN_THRESH_DELTA) { + /* move other thresh to be below this one */ + ret = set_thresh(bat, THRESH_START, + thresh-MIN_THRESH_DELTA); + if (ret) + goto out; + } + } + ret = set_thresh(bat, THRESH_STOP, thresh); +out: + up(&smapi_mutex); + return count; +} + +static ssize_t show_battery_inhibit_charge_minutes(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int minutes; + int bat = attr_get_bat(attr); + int ret = get_inhibit_charge_minutes(bat, &minutes); + if (ret) + return ret; + return sprintf(buf, "%d\n", minutes); /* units: minutes */ +} + +static ssize_t store_battery_inhibit_charge_minutes(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int ret; + int minutes; + int bat = attr_get_bat(attr); + if (sscanf(buf, "%d", &minutes) != 1 || minutes < 0) { + TPRINTK(KERN_ERR, "inhibit_charge_minutes: " + "must be a non-negative integer"); + return -EINVAL; + } + ret = set_inhibit_charge_minutes(bat, minutes); + if (ret) + return ret; + return count; +} + +static ssize_t show_battery_force_discharge(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int enabled; + int bat = attr_get_bat(attr); + int ret = get_force_discharge(bat, &enabled); + if (ret) + return ret; + return sprintf(buf, "%d\n", enabled); /* type: boolean */ +} + +static ssize_t store_battery_force_discharge(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + int ret; + int enabled; + int bat = attr_get_bat(attr); + if (sscanf(buf, "%d", &enabled) != 1 || enabled < 0 || enabled > 1) + return -EINVAL; + ret = set_force_discharge(bat, enabled); + if (ret) + return ret; + return count; +} + +static ssize_t show_battery_installed( + struct device *dev, struct device_attribute *attr, char *buf) +{ + int bat = attr_get_bat(attr); + int ret = power_device_present(bat); + if (ret < 0) + return ret; + return sprintf(buf, "%d\n", ret); /* type: boolean */ +} + +static ssize_t show_battery_state( + struct device *dev, struct device_attribute *attr, char *buf) +{ + u8 row[TP_CONTROLLER_ROW_LEN]; + const char *txt; + int ret; + int bat = attr_get_bat(attr); + if (bat_has_status(bat) != 1) + return sprintf(buf, "none\n"); + ret = read_tp_ec_row(1, bat, 0, row); + if (ret) + return ret; + switch (row[1] & 0xf0) { + case 0xc0: txt = "idle"; break; + case 0xd0: txt = "discharging"; break; + case 0xe0: txt = "charging"; break; + default: return sprintf(buf, "unknown (0x%x)\n", row[1]); + } + return sprintf(buf, "%s\n", txt); /* type: string from fixed set */ +} + +static ssize_t show_battery_manufacturer( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* type: string. SBS spec v1.1 p34: ManufacturerName() */ + return show_tp_ec_bat_str(4, 2, TP_CONTROLLER_ROW_LEN-2, attr, buf); +} + +static ssize_t show_battery_model( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* type: string. SBS spec v1.1 p34: DeviceName() */ + return show_tp_ec_bat_str(5, 2, TP_CONTROLLER_ROW_LEN-2, attr, buf); +} + +static ssize_t show_battery_barcoding( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* type: string */ + return show_tp_ec_bat_str(7, 2, TP_CONTROLLER_ROW_LEN-2, attr, buf); +} + +static ssize_t show_battery_chemistry( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* type: string. SBS spec v1.1 p34-35: DeviceChemistry() */ + return show_tp_ec_bat_str(6, 2, 5, attr, buf); +} + +static ssize_t show_battery_voltage( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mV. SBS spec v1.1 p24: Voltage() */ + return show_tp_ec_bat_u16(1, 6, 1, NULL, attr, buf); +} + +static ssize_t show_battery_design_voltage( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mV. SBS spec v1.1 p32: DesignVoltage() */ + return show_tp_ec_bat_u16(3, 4, 1, NULL, attr, buf); +} + +static ssize_t show_battery_charging_max_voltage( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mV. SBS spec v1.1 p37,39: ChargingVoltage() */ + return show_tp_ec_bat_u16(9, 8, 1, NULL, attr, buf); +} + +static ssize_t show_battery_group0_voltage( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mV */ + return show_tp_ec_bat_u16(0xA, 12, 1, NULL, attr, buf); +} + +static ssize_t show_battery_group1_voltage( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mV */ + return show_tp_ec_bat_u16(0xA, 10, 1, NULL, attr, buf); +} + +static ssize_t show_battery_group2_voltage( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mV */ + return show_tp_ec_bat_u16(0xA, 8, 1, NULL, attr, buf); +} + +static ssize_t show_battery_group3_voltage( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mV */ + return show_tp_ec_bat_u16(0xA, 6, 1, NULL, attr, buf); +} + +static ssize_t show_battery_current_now( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mA. SBS spec v1.1 p24: Current() */ + return show_tp_ec_bat_s16(1, 8, 1, 0, attr, buf); +} + +static ssize_t show_battery_current_avg( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mA. SBS spec v1.1 p24: AverageCurrent() */ + return show_tp_ec_bat_s16(1, 10, 1, 0, attr, buf); +} + +static ssize_t show_battery_charging_max_current( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mA. SBS spec v1.1 p36,38: ChargingCurrent() */ + return show_tp_ec_bat_s16(9, 6, 1, 0, attr, buf); +} + +static ssize_t show_battery_power_now( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mW. SBS spec v1.1: Voltage()*Current() */ + return show_tp_ec_bat_power(1, 6, 8, attr, buf); +} + +static ssize_t show_battery_power_avg( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mW. SBS spec v1.1: Voltage()*AverageCurrent() */ + return show_tp_ec_bat_power(1, 6, 10, attr, buf); +} + +static ssize_t show_battery_remaining_percent( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: percent. SBS spec v1.1 p25: RelativeStateOfCharge() */ + return show_tp_ec_bat_u16(1, 12, 1, NULL, attr, buf); +} + +static ssize_t show_battery_remaining_percent_error( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: percent. SBS spec v1.1 p25: MaxError() */ + return show_tp_ec_bat_u16(9, 4, 1, NULL, attr, buf); +} + +static ssize_t show_battery_remaining_charging_time( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: minutes. SBS spec v1.1 p27: AverageTimeToFull() */ + return show_tp_ec_bat_u16(2, 8, 1, "not_charging", attr, buf); +} + +static ssize_t show_battery_remaining_running_time( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: minutes. SBS spec v1.1 p27: RunTimeToEmpty() */ + return show_tp_ec_bat_u16(2, 6, 1, "not_discharging", attr, buf); +} + +static ssize_t show_battery_remaining_running_time_now( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: minutes. SBS spec v1.1 p27: RunTimeToEmpty() */ + return show_tp_ec_bat_u16(2, 4, 1, "not_discharging", attr, buf); +} + +static ssize_t show_battery_remaining_capacity( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mWh. SBS spec v1.1 p26. */ + return show_tp_ec_bat_u16(1, 14, 10, "", attr, buf); +} + +static ssize_t show_battery_last_full_capacity( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mWh. SBS spec v1.1 p26: FullChargeCapacity() */ + return show_tp_ec_bat_u16(2, 2, 10, "", attr, buf); +} + +static ssize_t show_battery_design_capacity( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: mWh. SBS spec v1.1 p32: DesignCapacity() */ + return show_tp_ec_bat_u16(3, 2, 10, "", attr, buf); +} + +static ssize_t show_battery_cycle_count( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: ordinal. SBS spec v1.1 p32: CycleCount() */ + return show_tp_ec_bat_u16(2, 12, 1, "", attr, buf); +} + +static ssize_t show_battery_temperature( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* units: millicelsius. SBS spec v1.1: Temperature()*10 */ + return show_tp_ec_bat_s16(1, 4, 100, -273100, attr, buf); +} + +static ssize_t show_battery_serial( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* type: int. SBS spec v1.1 p34: SerialNumber() */ + return show_tp_ec_bat_u16(3, 10, 1, "", attr, buf); +} + +static ssize_t show_battery_manufacture_date( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* type: YYYY-MM-DD. SBS spec v1.1 p34: ManufactureDate() */ + return show_tp_ec_bat_date(3, 8, attr, buf); +} + +static ssize_t show_battery_first_use_date( + struct device *dev, struct device_attribute *attr, char *buf) +{ + /* type: YYYY-MM-DD */ + return show_tp_ec_bat_date(8, 2, attr, buf); +} + +/** + * show_battery_dump - show the battery's dump attribute + * The dump attribute gives a hex dump of all EC readouts related to a + * battery. Some of the enumerated values don't really exist (i.e., the + * EC function just leaves them untouched); we use a kludge to detect and + * denote these. + */ +#define MIN_DUMP_ARG0 0x00 +#define MAX_DUMP_ARG0 0x0a /* 0x0b is useful too but hangs old EC firmware */ +static ssize_t show_battery_dump( + struct device *dev, struct device_attribute *attr, char *buf) +{ + int i; + char *p = buf; + int bat = attr_get_bat(attr); + u8 arg0; /* first argument to EC */ + u8 rowa[TP_CONTROLLER_ROW_LEN], + rowb[TP_CONTROLLER_ROW_LEN]; + const u8 junka = 0xAA, + junkb = 0x55; /* junk values for testing changes */ + int ret; + + for (arg0 = MIN_DUMP_ARG0; arg0 <= MAX_DUMP_ARG0; ++arg0) { + if ((p-buf) > PAGE_SIZE-TP_CONTROLLER_ROW_LEN*5) + return -ENOMEM; /* don't overflow sysfs buf */ + /* Read raw twice with different junk values, + * to detect unused output bytes which are left unchaged: */ + ret = read_tp_ec_row(arg0, bat, junka, rowa); + if (ret) + return ret; + ret = read_tp_ec_row(arg0, bat, junkb, rowb); + if (ret) + return ret; + for (i = 0; i < TP_CONTROLLER_ROW_LEN; i++) { + if (rowa[i] == junka && rowb[i] == junkb) + p += sprintf(p, "-- "); /* unused by EC */ + else + p += sprintf(p, "%02x ", rowa[i]); + } + p += sprintf(p, "\n"); + } + return p-buf; +} + + +/********************************************************************* + * sysfs attribute I/O, other than batteries + */ + +static ssize_t show_ac_connected( + struct device *dev, struct device_attribute *attr, char *buf) +{ + int ret = power_device_present(0xFF); + if (ret < 0) + return ret; + return sprintf(buf, "%d\n", ret); /* type: boolean */ +} + +/********************************************************************* + * The the "smapi_request" sysfs attribute executes a raw SMAPI call. + * You write to make a request and read to get the result. The state + * is saved globally rather than per fd (sysfs limitation), so + * simultaenous requests may get each other's results! So this is for + * development and debugging only. + */ +#define MAX_SMAPI_ATTR_ANSWER_LEN 128 +static char smapi_attr_answer[MAX_SMAPI_ATTR_ANSWER_LEN] = ""; + +static ssize_t show_smapi_request(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int ret = snprintf(buf, PAGE_SIZE, "%s", smapi_attr_answer); + smapi_attr_answer[0] = '\0'; + return ret; +} + +static ssize_t store_smapi_request(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + unsigned int inEBX, inECX, inEDI, inESI; + u32 outEBX, outECX, outEDX, outEDI, outESI; + const char *msg; + int ret; + if (sscanf(buf, "%x %x %x %x", &inEBX, &inECX, &inEDI, &inESI) != 4) { + smapi_attr_answer[0] = '\0'; + return -EINVAL; + } + ret = smapi_request( + inEBX, inECX, inEDI, inESI, + &outEBX, &outECX, &outEDX, &outEDI, &outESI, &msg); + snprintf(smapi_attr_answer, MAX_SMAPI_ATTR_ANSWER_LEN, + "%x %x %x %x %x %d '%s'\n", + (unsigned int)outEBX, (unsigned int)outECX, + (unsigned int)outEDX, (unsigned int)outEDI, + (unsigned int)outESI, ret, msg); + if (ret) + return ret; + else + return count; +} + +/********************************************************************* + * Power management: the embedded controller forgets the battery + * thresholds when the system is suspended to disk and unplugged from + * AC and battery, so we restore it upon resume. + */ + +static int saved_threshs[4] = {-1, -1, -1, -1}; /* -1 = don't know */ + +static int tp_suspend(struct platform_device *dev, pm_message_t state) +{ + int restore = (state.event == PM_EVENT_HIBERNATE || + state.event == PM_EVENT_FREEZE); + if (!restore || get_real_thresh(0, THRESH_STOP , &saved_threshs[0])) + saved_threshs[0] = -1; + if (!restore || get_real_thresh(0, THRESH_START, &saved_threshs[1])) + saved_threshs[1] = -1; + if (!restore || get_real_thresh(1, THRESH_STOP , &saved_threshs[2])) + saved_threshs[2] = -1; + if (!restore || get_real_thresh(1, THRESH_START, &saved_threshs[3])) + saved_threshs[3] = -1; + DPRINTK("suspend saved: %d %d %d %d", saved_threshs[0], + saved_threshs[1], saved_threshs[2], saved_threshs[3]); + return 0; +} + +static int tp_resume(struct platform_device *dev) +{ + DPRINTK("resume restoring: %d %d %d %d", saved_threshs[0], + saved_threshs[1], saved_threshs[2], saved_threshs[3]); + if (saved_threshs[0] >= 0) + set_real_thresh(0, THRESH_STOP , saved_threshs[0]); + if (saved_threshs[1] >= 0) + set_real_thresh(0, THRESH_START, saved_threshs[1]); + if (saved_threshs[2] >= 0) + set_real_thresh(1, THRESH_STOP , saved_threshs[2]); + if (saved_threshs[3] >= 0) + set_real_thresh(1, THRESH_START, saved_threshs[3]); + return 0; +} + + +/********************************************************************* + * Driver model + */ + +static struct platform_driver tp_driver = { + .suspend = tp_suspend, + .resume = tp_resume, + .driver = { + .name = "smapi", + .owner = THIS_MODULE + }, +}; + + +/********************************************************************* + * Sysfs device model + */ + +/* Attributes in /sys/devices/platform/smapi/ */ + +static DEVICE_ATTR(ac_connected, 0444, show_ac_connected, NULL); +static DEVICE_ATTR(smapi_request, 0600, show_smapi_request, + store_smapi_request); + +static struct attribute *tp_root_attributes[] = { + &dev_attr_ac_connected.attr, + &dev_attr_smapi_request.attr, + NULL +}; +static struct attribute_group tp_root_attribute_group = { + .attrs = tp_root_attributes +}; + +/* Attributes under /sys/devices/platform/smapi/BAT{0,1}/ : + * Every attribute needs to be defined (i.e., statically allocated) for + * each battery, and then referenced in the attribute list of each battery. + * We use preprocessor voodoo to avoid duplicating the list of attributes 4 + * times. The preprocessor output is just normal sysfs attributes code. + */ + +/** + * FOREACH_BAT_ATTR - invoke the given macros on all our battery attributes + * @_BAT: battery number (0 or 1) + * @_ATTR_RW: macro to invoke for each read/write attribute + * @_ATTR_R: macro to invoke for each read-only attribute + */ +#define FOREACH_BAT_ATTR(_BAT, _ATTR_RW, _ATTR_R) \ + _ATTR_RW(_BAT, start_charge_thresh) \ + _ATTR_RW(_BAT, stop_charge_thresh) \ + _ATTR_RW(_BAT, inhibit_charge_minutes) \ + _ATTR_RW(_BAT, force_discharge) \ + _ATTR_R(_BAT, installed) \ + _ATTR_R(_BAT, state) \ + _ATTR_R(_BAT, manufacturer) \ + _ATTR_R(_BAT, model) \ + _ATTR_R(_BAT, barcoding) \ + _ATTR_R(_BAT, chemistry) \ + _ATTR_R(_BAT, voltage) \ + _ATTR_R(_BAT, group0_voltage) \ + _ATTR_R(_BAT, group1_voltage) \ + _ATTR_R(_BAT, group2_voltage) \ + _ATTR_R(_BAT, group3_voltage) \ + _ATTR_R(_BAT, current_now) \ + _ATTR_R(_BAT, current_avg) \ + _ATTR_R(_BAT, charging_max_current) \ + _ATTR_R(_BAT, power_now) \ + _ATTR_R(_BAT, power_avg) \ + _ATTR_R(_BAT, remaining_percent) \ + _ATTR_R(_BAT, remaining_percent_error) \ + _ATTR_R(_BAT, remaining_charging_time) \ + _ATTR_R(_BAT, remaining_running_time) \ + _ATTR_R(_BAT, remaining_running_time_now) \ + _ATTR_R(_BAT, remaining_capacity) \ + _ATTR_R(_BAT, last_full_capacity) \ + _ATTR_R(_BAT, design_voltage) \ + _ATTR_R(_BAT, charging_max_voltage) \ + _ATTR_R(_BAT, design_capacity) \ + _ATTR_R(_BAT, cycle_count) \ + _ATTR_R(_BAT, temperature) \ + _ATTR_R(_BAT, serial) \ + _ATTR_R(_BAT, manufacture_date) \ + _ATTR_R(_BAT, first_use_date) \ + _ATTR_R(_BAT, dump) + +/* Define several macros we will feed into FOREACH_BAT_ATTR: */ + +#define DEFINE_BAT_ATTR_RW(_BAT,_NAME) \ + static struct bat_device_attribute dev_attr_##_NAME##_##_BAT = { \ + .dev_attr = __ATTR(_NAME, 0644, show_battery_##_NAME, \ + store_battery_##_NAME), \ + .bat = _BAT \ + }; + +#define DEFINE_BAT_ATTR_R(_BAT,_NAME) \ + static struct bat_device_attribute dev_attr_##_NAME##_##_BAT = { \ + .dev_attr = __ATTR(_NAME, 0644, show_battery_##_NAME, 0), \ + .bat = _BAT \ + }; + +#define REF_BAT_ATTR(_BAT,_NAME) \ + &dev_attr_##_NAME##_##_BAT.dev_attr.attr, + +/* This provide all attributes for one battery: */ + +#define PROVIDE_BAT_ATTRS(_BAT) \ + FOREACH_BAT_ATTR(_BAT, DEFINE_BAT_ATTR_RW, DEFINE_BAT_ATTR_R) \ + static struct attribute *tp_bat##_BAT##_attributes[] = { \ + FOREACH_BAT_ATTR(_BAT, REF_BAT_ATTR, REF_BAT_ATTR) \ + NULL \ + }; \ + static struct attribute_group tp_bat##_BAT##_attribute_group = { \ + .name = "BAT" #_BAT, \ + .attrs = tp_bat##_BAT##_attributes \ + }; + +/* Finally genereate the attributes: */ + +PROVIDE_BAT_ATTRS(0) +PROVIDE_BAT_ATTRS(1) + +/* List of attribute groups */ + +static struct attribute_group *attr_groups[] = { + &tp_root_attribute_group, + &tp_bat0_attribute_group, + &tp_bat1_attribute_group, + NULL +}; + + +/********************************************************************* + * Init and cleanup + */ + +static struct attribute_group **next_attr_group; /* next to register */ + +static int __init tp_init(void) +{ + int ret; + printk(KERN_INFO "tp_smapi " TP_VERSION " loading...\n"); + + ret = find_smapi_port(); + if (ret < 0) + goto err; + else + smapi_port = ret; + + if (!request_region(smapi_port, 1, "smapi")) { + printk(KERN_ERR "tp_smapi cannot claim port 0x%x\n", + smapi_port); + ret = -ENXIO; + goto err; + } + + if (!request_region(SMAPI_PORT2, 1, "smapi")) { + printk(KERN_ERR "tp_smapi cannot claim port 0x%x\n", + SMAPI_PORT2); + ret = -ENXIO; + goto err_port1; + } + + ret = platform_driver_register(&tp_driver); + if (ret) + goto err_port2; + + pdev = platform_device_alloc("smapi", -1); + if (!pdev) { + ret = -ENOMEM; + goto err_driver; + } + + ret = platform_device_add(pdev); + if (ret) + goto err_device_free; + + for (next_attr_group = attr_groups; *next_attr_group; + ++next_attr_group) { + ret = sysfs_create_group(&pdev->dev.kobj, *next_attr_group); + if (ret) + goto err_attr; + } + + printk(KERN_INFO "tp_smapi successfully loaded (smapi_port=0x%x).\n", + smapi_port); + return 0; + +err_attr: + while (--next_attr_group >= attr_groups) + sysfs_remove_group(&pdev->dev.kobj, *next_attr_group); + platform_device_unregister(pdev); +err_device_free: + platform_device_put(pdev); +err_driver: + platform_driver_unregister(&tp_driver); +err_port2: + release_region(SMAPI_PORT2, 1); +err_port1: + release_region(smapi_port, 1); +err: + printk(KERN_ERR "tp_smapi init failed (ret=%d)!\n", ret); + return ret; +} + +static void __exit tp_exit(void) +{ + while (next_attr_group && --next_attr_group >= attr_groups) + sysfs_remove_group(&pdev->dev.kobj, *next_attr_group); + platform_device_unregister(pdev); + platform_driver_unregister(&tp_driver); + release_region(SMAPI_PORT2, 1); + if (smapi_port) + release_region(smapi_port, 1); + + printk(KERN_INFO "tp_smapi unloaded.\n"); +} + +module_init(tp_init); +module_exit(tp_exit); diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig index 5d3b86a..7f27a43 100644 --- a/drivers/staging/Kconfig +++ b/drivers/staging/Kconfig @@ -110,4 +110,6 @@ source "drivers/staging/wilc1000/Kconfig" source "drivers/staging/most/Kconfig" +source "drivers/staging/vhba/Kconfig" + endif # STAGING diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile index 30918ed..3d47777 100644 --- a/drivers/staging/Makefile +++ b/drivers/staging/Makefile @@ -24,6 +24,7 @@ obj-$(CONFIG_VME_BUS) += vme/ obj-$(CONFIG_IIO) += iio/ obj-$(CONFIG_FB_SM750) += sm750fb/ obj-$(CONFIG_FB_XGI) += xgifb/ +obj-$(CONFIG_VHBA) += vhba/ obj-$(CONFIG_USB_EMXX) += emxx_udc/ obj-$(CONFIG_SPEAKUP) += speakup/ obj-$(CONFIG_TOUCHSCREEN_SYNAPTICS_I2C_RMI4) += ste_rmi4/ diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c index 5bb1283..f4f8f03 100644 --- a/drivers/staging/android/ashmem.c +++ b/drivers/staging/android/ashmem.c @@ -387,7 +387,7 @@ static int ashmem_mmap(struct file *file, struct vm_area_struct *vma) name = asma->name; /* ... and allocate the backing shmem file */ - vmfile = shmem_file_setup(name, asma->size, vma->vm_flags); + vmfile = shmem_file_setup(name, asma->size, vma->vm_flags, 0); if (IS_ERR(vmfile)) { ret = PTR_ERR(vmfile); goto out; diff --git b/drivers/staging/vhba/Kconfig b/drivers/staging/vhba/Kconfig new file mode 100644 index 0000000..7ccb7d8 --- /dev/null +++ b/drivers/staging/vhba/Kconfig @@ -0,0 +1,9 @@ +config VHBA + tristate "Virtual (SCSI) Host Bus Adapter" + depends on SCSI + ---help--- + This is the in-kernel part of CDEmu, a CD/DVD-ROM device + emulator. + + This driver can also be built as a module. If so, the module + will be called vhba. diff --git b/drivers/staging/vhba/Makefile b/drivers/staging/vhba/Makefile new file mode 100644 index 0000000..60b9e26 --- /dev/null +++ b/drivers/staging/vhba/Makefile @@ -0,0 +1,4 @@ +VHBA_VERSION := 20140928 + +obj-$(CONFIG_VHBA) += vhba.o +ccflags-y := -DVHBA_VERSION=\"$(VHBA_VERSION)\" -Werror diff --git b/drivers/staging/vhba/vhba.c b/drivers/staging/vhba/vhba.c new file mode 100644 index 0000000..4a14a10 --- /dev/null +++ b/drivers/staging/vhba/vhba.c @@ -0,0 +1,1071 @@ +/* + * vhba.c + * + * Copyright (C) 2007-2012 Chia-I Wu + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License along + * with this program; if not, write to the Free Software Foundation, Inc., + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#ifdef CONFIG_COMPAT +#include +#endif +#include +#include +#include +#include +#include + +/* scatterlist.page_link and sg_page() were introduced in 2.6.24 */ +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 24) +#define USE_SG_PAGE +#include +#endif + +MODULE_AUTHOR("Chia-I Wu"); +MODULE_VERSION(VHBA_VERSION); +MODULE_DESCRIPTION("Virtual SCSI HBA"); +MODULE_LICENSE("GPL"); + +#ifdef DEBUG +#define DPRINTK(fmt, args...) printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args) +#else +#define DPRINTK(fmt, args...) +#endif + +/* scmd_dbg was introduced in 3.15 */ +#ifndef scmd_dbg +#define scmd_dbg(scmd, fmt, a...) \ + dev_dbg(&(scmd)->device->sdev_gendev, fmt, ##a) +#endif + +#ifndef scmd_warn +#define scmd_warn(scmd, fmt, a...) \ + dev_warn(&(scmd)->device->sdev_gendev, fmt, ##a) +#endif + +#define VHBA_MAX_SECTORS_PER_IO 256 +#define VHBA_MAX_ID 32 +#define VHBA_CAN_QUEUE 32 +#define VHBA_INVALID_ID VHBA_MAX_ID + +#define DATA_TO_DEVICE(dir) ((dir) == DMA_TO_DEVICE || (dir) == DMA_BIDIRECTIONAL) +#define DATA_FROM_DEVICE(dir) ((dir) == DMA_FROM_DEVICE || (dir) == DMA_BIDIRECTIONAL) + + +/* SCSI macros were introduced in 2.6.23 */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 23) +#define scsi_sg_count(cmd) ((cmd)->use_sg) +#define scsi_sglist(cmd) ((cmd)->request_buffer) +#define scsi_bufflen(cmd) ((cmd)->request_bufflen) +#define scsi_set_resid(cmd, to_read) {(cmd)->resid = (to_read);} +#endif + +/* 1-argument form of k[un]map_atomic was introduced in 2.6.37-rc1; + 2-argument form was deprecated in 3.4-rc1 */ +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 37) +#define vhba_kmap_atomic kmap_atomic +#define vhba_kunmap_atomic kunmap_atomic +#else +#define vhba_kmap_atomic(page) kmap_atomic(page, KM_USER0) +#define vhba_kunmap_atomic(page) kunmap_atomic(page, KM_USER0) +#endif + + +enum vhba_req_state { + VHBA_REQ_FREE, + VHBA_REQ_PENDING, + VHBA_REQ_READING, + VHBA_REQ_SENT, + VHBA_REQ_WRITING, +}; + +struct vhba_command { + struct scsi_cmnd *cmd; + int status; + struct list_head entry; +}; + +struct vhba_device { + uint id; + spinlock_t cmd_lock; + struct list_head cmd_list; + wait_queue_head_t cmd_wq; + atomic_t refcnt; +}; + +struct vhba_host { + struct Scsi_Host *shost; + spinlock_t cmd_lock; + int cmd_next; + struct vhba_command commands[VHBA_CAN_QUEUE]; + spinlock_t dev_lock; + struct vhba_device *devices[VHBA_MAX_ID]; + int num_devices; + DECLARE_BITMAP(chgmap, VHBA_MAX_ID); + int chgtype[VHBA_MAX_ID]; + struct work_struct scan_devices; +}; + +#define MAX_COMMAND_SIZE 16 + +struct vhba_request { + __u32 tag; + __u32 lun; + __u8 cdb[MAX_COMMAND_SIZE]; + __u8 cdb_len; + __u32 data_len; +}; + +struct vhba_response { + __u32 tag; + __u32 status; + __u32 data_len; +}; + +static struct vhba_command *vhba_alloc_command (void); +static void vhba_free_command (struct vhba_command *vcmd); + +static struct platform_device vhba_platform_device; + +static struct vhba_device *vhba_device_alloc (void) +{ + struct vhba_device *vdev; + + vdev = kzalloc(sizeof(struct vhba_device), GFP_KERNEL); + if (!vdev) { + return NULL; + } + + vdev->id = VHBA_INVALID_ID; + spin_lock_init(&vdev->cmd_lock); + INIT_LIST_HEAD(&vdev->cmd_list); + init_waitqueue_head(&vdev->cmd_wq); + atomic_set(&vdev->refcnt, 1); + + return vdev; +} + +static void vhba_device_put (struct vhba_device *vdev) +{ + if (atomic_dec_and_test(&vdev->refcnt)) { + kfree(vdev); + } +} + +static struct vhba_device *vhba_device_get (struct vhba_device *vdev) +{ + atomic_inc(&vdev->refcnt); + + return vdev; +} + +static int vhba_device_queue (struct vhba_device *vdev, struct scsi_cmnd *cmd) +{ + struct vhba_command *vcmd; + unsigned long flags; + + vcmd = vhba_alloc_command(); + if (!vcmd) { + return SCSI_MLQUEUE_HOST_BUSY; + } + + vcmd->cmd = cmd; + + spin_lock_irqsave(&vdev->cmd_lock, flags); + list_add_tail(&vcmd->entry, &vdev->cmd_list); + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + wake_up_interruptible(&vdev->cmd_wq); + + return 0; +} + +static int vhba_device_dequeue (struct vhba_device *vdev, struct scsi_cmnd *cmd) +{ + struct vhba_command *vcmd; + int retval; + unsigned long flags; + + spin_lock_irqsave(&vdev->cmd_lock, flags); + list_for_each_entry(vcmd, &vdev->cmd_list, entry) { + if (vcmd->cmd == cmd) { + list_del_init(&vcmd->entry); + break; + } + } + + /* command not found */ + if (&vcmd->entry == &vdev->cmd_list) { + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + return SUCCESS; + } + + while (vcmd->status == VHBA_REQ_READING || vcmd->status == VHBA_REQ_WRITING) { + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + scmd_dbg(cmd, "wait for I/O before aborting\n"); + schedule_timeout(1); + spin_lock_irqsave(&vdev->cmd_lock, flags); + } + + retval = (vcmd->status == VHBA_REQ_SENT) ? FAILED : SUCCESS; + + vhba_free_command(vcmd); + + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + return retval; +} + +static inline void vhba_scan_devices_add (struct vhba_host *vhost, int id) +{ + struct scsi_device *sdev; + + sdev = scsi_device_lookup(vhost->shost, 0, id, 0); + if (!sdev) { + scsi_add_device(vhost->shost, 0, id, 0); + } else { + dev_warn(&vhost->shost->shost_gendev, "tried to add an already-existing device 0:%d:0!\n", id); + scsi_device_put(sdev); + } +} + +static inline void vhba_scan_devices_remove (struct vhba_host *vhost, int id) +{ + struct scsi_device *sdev; + + sdev = scsi_device_lookup(vhost->shost, 0, id, 0); + if (sdev) { + scsi_remove_device(sdev); + scsi_device_put(sdev); + } else { + dev_warn(&vhost->shost->shost_gendev, "tried to remove non-existing device 0:%d:0!\n", id); + } +} + +static void vhba_scan_devices (struct work_struct *work) +{ + struct vhba_host *vhost = container_of(work, struct vhba_host, scan_devices); + unsigned long flags; + int id, change, exists; + + while (1) { + spin_lock_irqsave(&vhost->dev_lock, flags); + + id = find_first_bit(vhost->chgmap, vhost->shost->max_id); + if (id >= vhost->shost->max_id) { + spin_unlock_irqrestore(&vhost->dev_lock, flags); + break; + } + change = vhost->chgtype[id]; + exists = vhost->devices[id] != NULL; + + vhost->chgtype[id] = 0; + clear_bit(id, vhost->chgmap); + + spin_unlock_irqrestore(&vhost->dev_lock, flags); + + if (change < 0) { + dev_dbg(&vhost->shost->shost_gendev, "trying to remove target 0:%d:0\n", id); + vhba_scan_devices_remove(vhost, id); + } else if (change > 0) { + dev_dbg(&vhost->shost->shost_gendev, "trying to add target 0:%d:0\n", id); + vhba_scan_devices_add(vhost, id); + } else { + /* quick sequence of add/remove or remove/add; we determine + which one it was by checking if device structure exists */ + if (exists) { + /* remove followed by add: remove and (re)add */ + dev_dbg(&vhost->shost->shost_gendev, "trying to (re)add target 0:%d:0\n", id); + vhba_scan_devices_remove(vhost, id); + vhba_scan_devices_add(vhost, id); + } else { + /* add followed by remove: no-op */ + dev_dbg(&vhost->shost->shost_gendev, "no-op for target 0:%d:0\n", id); + } + } + } +} + +static int vhba_add_device (struct vhba_device *vdev) +{ + struct vhba_host *vhost; + int i; + unsigned long flags; + + vhost = platform_get_drvdata(&vhba_platform_device); + + vhba_device_get(vdev); + + spin_lock_irqsave(&vhost->dev_lock, flags); + if (vhost->num_devices >= vhost->shost->max_id) { + spin_unlock_irqrestore(&vhost->dev_lock, flags); + vhba_device_put(vdev); + return -EBUSY; + } + + for (i = 0; i < vhost->shost->max_id; i++) { + if (vhost->devices[i] == NULL) { + vdev->id = i; + vhost->devices[i] = vdev; + vhost->num_devices++; + set_bit(vdev->id, vhost->chgmap); + vhost->chgtype[vdev->id]++; + break; + } + } + spin_unlock_irqrestore(&vhost->dev_lock, flags); + + schedule_work(&vhost->scan_devices); + + return 0; +} + +static int vhba_remove_device (struct vhba_device *vdev) +{ + struct vhba_host *vhost; + unsigned long flags; + + vhost = platform_get_drvdata(&vhba_platform_device); + + spin_lock_irqsave(&vhost->dev_lock, flags); + set_bit(vdev->id, vhost->chgmap); + vhost->chgtype[vdev->id]--; + vhost->devices[vdev->id] = NULL; + vhost->num_devices--; + vdev->id = VHBA_INVALID_ID; + spin_unlock_irqrestore(&vhost->dev_lock, flags); + + vhba_device_put(vdev); + + schedule_work(&vhost->scan_devices); + + return 0; +} + +static struct vhba_device *vhba_lookup_device (int id) +{ + struct vhba_host *vhost; + struct vhba_device *vdev = NULL; + unsigned long flags; + + vhost = platform_get_drvdata(&vhba_platform_device); + + if (likely(id < vhost->shost->max_id)) { + spin_lock_irqsave(&vhost->dev_lock, flags); + vdev = vhost->devices[id]; + if (vdev) { + vdev = vhba_device_get(vdev); + } + + spin_unlock_irqrestore(&vhost->dev_lock, flags); + } + + return vdev; +} + +static struct vhba_command *vhba_alloc_command (void) +{ + struct vhba_host *vhost; + struct vhba_command *vcmd; + unsigned long flags; + int i; + + vhost = platform_get_drvdata(&vhba_platform_device); + + spin_lock_irqsave(&vhost->cmd_lock, flags); + + vcmd = vhost->commands + vhost->cmd_next++; + if (vcmd->status != VHBA_REQ_FREE) { + for (i = 0; i < vhost->shost->can_queue; i++) { + vcmd = vhost->commands + i; + + if (vcmd->status == VHBA_REQ_FREE) { + vhost->cmd_next = i + 1; + break; + } + } + + if (i == vhost->shost->can_queue) { + vcmd = NULL; + } + } + + if (vcmd) { + vcmd->status = VHBA_REQ_PENDING; + } + + vhost->cmd_next %= vhost->shost->can_queue; + + spin_unlock_irqrestore(&vhost->cmd_lock, flags); + + return vcmd; +} + +static void vhba_free_command (struct vhba_command *vcmd) +{ + struct vhba_host *vhost; + unsigned long flags; + + vhost = platform_get_drvdata(&vhba_platform_device); + + spin_lock_irqsave(&vhost->cmd_lock, flags); + vcmd->status = VHBA_REQ_FREE; + spin_unlock_irqrestore(&vhost->cmd_lock, flags); +} + +static int vhba_queuecommand_lck (struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *)) +{ + struct vhba_device *vdev; + int retval; + + scmd_dbg(cmd, "queue %lu\n", cmd->serial_number); + + vdev = vhba_lookup_device(cmd->device->id); + if (!vdev) { + scmd_dbg(cmd, "no such device\n"); + + cmd->result = DID_NO_CONNECT << 16; + done(cmd); + + return 0; + } + + cmd->scsi_done = done; + retval = vhba_device_queue(vdev, cmd); + + vhba_device_put(vdev); + + return retval; +} + +#ifdef DEF_SCSI_QCMD +DEF_SCSI_QCMD(vhba_queuecommand) +#else +#define vhba_queuecommand vhba_queuecommand_lck +#endif + +static int vhba_abort (struct scsi_cmnd *cmd) +{ + struct vhba_device *vdev; + int retval = SUCCESS; + + scmd_warn(cmd, "abort %lu\n", cmd->serial_number); + + vdev = vhba_lookup_device(cmd->device->id); + if (vdev) { + retval = vhba_device_dequeue(vdev, cmd); + vhba_device_put(vdev); + } else { + cmd->result = DID_NO_CONNECT << 16; + } + + return retval; +} + +static struct scsi_host_template vhba_template = { + .module = THIS_MODULE, + .name = "vhba", + .proc_name = "vhba", + .queuecommand = vhba_queuecommand, + .eh_abort_handler = vhba_abort, + .can_queue = VHBA_CAN_QUEUE, + .this_id = -1, + .cmd_per_lun = 1, + .max_sectors = VHBA_MAX_SECTORS_PER_IO, + .sg_tablesize = 256, +}; + +static ssize_t do_request (struct scsi_cmnd *cmd, char __user *buf, size_t buf_len) +{ + struct vhba_request vreq; + ssize_t ret; + + scmd_dbg(cmd, "request %lu, cdb 0x%x, bufflen %d, use_sg %d\n", + cmd->serial_number, cmd->cmnd[0], scsi_bufflen(cmd), scsi_sg_count(cmd)); + + ret = sizeof(vreq); + if (DATA_TO_DEVICE(cmd->sc_data_direction)) { + ret += scsi_bufflen(cmd); + } + + if (ret > buf_len) { + scmd_warn(cmd, "buffer too small (%zd < %zd) for a request\n", buf_len, ret); + return -EIO; + } + + vreq.tag = cmd->serial_number; + vreq.lun = cmd->device->lun; + memcpy(vreq.cdb, cmd->cmnd, MAX_COMMAND_SIZE); + vreq.cdb_len = cmd->cmd_len; + vreq.data_len = scsi_bufflen(cmd); + + if (copy_to_user(buf, &vreq, sizeof(vreq))) { + return -EFAULT; + } + + if (DATA_TO_DEVICE(cmd->sc_data_direction) && vreq.data_len) { + buf += sizeof(vreq); + + if (scsi_sg_count(cmd)) { + unsigned char buf_stack[64]; + unsigned char *kaddr, *uaddr, *kbuf; + struct scatterlist *sg = scsi_sglist(cmd); + int i; + + uaddr = (unsigned char *) buf; + + if (vreq.data_len > 64) { + kbuf = kmalloc(PAGE_SIZE, GFP_KERNEL); + } else { + kbuf = buf_stack; + } + + for (i = 0; i < scsi_sg_count(cmd); i++) { + size_t len = sg[i].length; + +#ifdef USE_SG_PAGE + kaddr = vhba_kmap_atomic(sg_page(&sg[i])); +#else + kaddr = vhba_kmap_atomic(sg[i].page); +#endif + memcpy(kbuf, kaddr + sg[i].offset, len); + vhba_kunmap_atomic(kaddr); + + if (copy_to_user(uaddr, kbuf, len)) { + if (kbuf != buf_stack) { + kfree(kbuf); + } + return -EFAULT; + } + uaddr += len; + } + + if (kbuf != buf_stack) { + kfree(kbuf); + } + } else { + if (copy_to_user(buf, scsi_sglist(cmd), vreq.data_len)) { + return -EFAULT; + } + } + } + + return ret; +} + +static ssize_t do_response (struct scsi_cmnd *cmd, const char __user *buf, size_t buf_len, struct vhba_response *res) +{ + ssize_t ret = 0; + + scmd_dbg(cmd, "response %lu, status %x, data len %d, use_sg %d\n", + cmd->serial_number, res->status, res->data_len, scsi_sg_count(cmd)); + + if (res->status) { + unsigned char sense_stack[SCSI_SENSE_BUFFERSIZE]; + + if (res->data_len > SCSI_SENSE_BUFFERSIZE) { + scmd_warn(cmd, "truncate sense (%d < %d)", SCSI_SENSE_BUFFERSIZE, res->data_len); + res->data_len = SCSI_SENSE_BUFFERSIZE; + } + + /* Copy via temporary buffer on stack in order to avoid problems + with PAX on grsecurity-enabled kernels */ + if (copy_from_user(sense_stack, buf, res->data_len)) { + return -EFAULT; + } + memcpy(cmd->sense_buffer, sense_stack, res->data_len); + + cmd->result = res->status; + + ret += res->data_len; + } else if (DATA_FROM_DEVICE(cmd->sc_data_direction) && scsi_bufflen(cmd)) { + size_t to_read; + + if (res->data_len > scsi_bufflen(cmd)) { + scmd_warn(cmd, "truncate data (%d < %d)\n", scsi_bufflen(cmd), res->data_len); + res->data_len = scsi_bufflen(cmd); + } + + to_read = res->data_len; + + if (scsi_sg_count(cmd)) { + unsigned char buf_stack[64]; + unsigned char *kaddr, *uaddr, *kbuf; + struct scatterlist *sg = scsi_sglist(cmd); + int i; + + uaddr = (unsigned char *)buf; + + if (res->data_len > 64) { + kbuf = kmalloc(PAGE_SIZE, GFP_KERNEL); + } else { + kbuf = buf_stack; + } + + for (i = 0; i < scsi_sg_count(cmd); i++) { + size_t len = (sg[i].length < to_read) ? sg[i].length : to_read; + + if (copy_from_user(kbuf, uaddr, len)) { + if (kbuf != buf_stack) { + kfree(kbuf); + } + return -EFAULT; + } + uaddr += len; + +#ifdef USE_SG_PAGE + kaddr = vhba_kmap_atomic(sg_page(&sg[i])); +#else + kaddr = vhba_kmap_atomic(sg[i].page); +#endif + memcpy(kaddr + sg[i].offset, kbuf, len); + vhba_kunmap_atomic(kaddr); + + to_read -= len; + if (to_read == 0) { + break; + } + } + + if (kbuf != buf_stack) { + kfree(kbuf); + } + } else { + if (copy_from_user(scsi_sglist(cmd), buf, res->data_len)) { + return -EFAULT; + } + + to_read -= res->data_len; + } + + scsi_set_resid(cmd, to_read); + + ret += res->data_len - to_read; + } + + return ret; +} + +static inline struct vhba_command *next_command (struct vhba_device *vdev) +{ + struct vhba_command *vcmd; + + list_for_each_entry(vcmd, &vdev->cmd_list, entry) { + if (vcmd->status == VHBA_REQ_PENDING) { + break; + } + } + + if (&vcmd->entry == &vdev->cmd_list) { + vcmd = NULL; + } + + return vcmd; +} + +static inline struct vhba_command *match_command (struct vhba_device *vdev, u32 tag) +{ + struct vhba_command *vcmd; + + list_for_each_entry(vcmd, &vdev->cmd_list, entry) { + if (vcmd->cmd->serial_number == tag) { + break; + } + } + + if (&vcmd->entry == &vdev->cmd_list) { + vcmd = NULL; + } + + return vcmd; +} + +static struct vhba_command *wait_command (struct vhba_device *vdev, unsigned long flags) +{ + struct vhba_command *vcmd; + DEFINE_WAIT(wait); + + while (!(vcmd = next_command(vdev))) { + if (signal_pending(current)) { + break; + } + + prepare_to_wait(&vdev->cmd_wq, &wait, TASK_INTERRUPTIBLE); + + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + schedule(); + + spin_lock_irqsave(&vdev->cmd_lock, flags); + } + + finish_wait(&vdev->cmd_wq, &wait); + if (vcmd) { + vcmd->status = VHBA_REQ_READING; + } + + return vcmd; +} + +static ssize_t vhba_ctl_read (struct file *file, char __user *buf, size_t buf_len, loff_t *offset) +{ + struct vhba_device *vdev; + struct vhba_command *vcmd; + ssize_t ret; + unsigned long flags; + + vdev = file->private_data; + + /* Get next command */ + if (file->f_flags & O_NONBLOCK) { + /* Non-blocking variant */ + spin_lock_irqsave(&vdev->cmd_lock, flags); + vcmd = next_command(vdev); + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + if (!vcmd) { + return -EWOULDBLOCK; + } + } else { + /* Blocking variant */ + spin_lock_irqsave(&vdev->cmd_lock, flags); + vcmd = wait_command(vdev, flags); + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + if (!vcmd) { + return -ERESTARTSYS; + } + } + + ret = do_request(vcmd->cmd, buf, buf_len); + + spin_lock_irqsave(&vdev->cmd_lock, flags); + if (ret >= 0) { + vcmd->status = VHBA_REQ_SENT; + *offset += ret; + } else { + vcmd->status = VHBA_REQ_PENDING; + } + + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + return ret; +} + +static ssize_t vhba_ctl_write (struct file *file, const char __user *buf, size_t buf_len, loff_t *offset) +{ + struct vhba_device *vdev; + struct vhba_command *vcmd; + struct vhba_response res; + ssize_t ret; + unsigned long flags; + + if (buf_len < sizeof(res)) { + return -EIO; + } + + if (copy_from_user(&res, buf, sizeof(res))) { + return -EFAULT; + } + + vdev = file->private_data; + + spin_lock_irqsave(&vdev->cmd_lock, flags); + vcmd = match_command(vdev, res.tag); + if (!vcmd || vcmd->status != VHBA_REQ_SENT) { + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + DPRINTK("not expecting response\n"); + return -EIO; + } + vcmd->status = VHBA_REQ_WRITING; + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + ret = do_response(vcmd->cmd, buf + sizeof(res), buf_len - sizeof(res), &res); + + spin_lock_irqsave(&vdev->cmd_lock, flags); + if (ret >= 0) { + vcmd->cmd->scsi_done(vcmd->cmd); + ret += sizeof(res); + + /* don't compete with vhba_device_dequeue */ + if (!list_empty(&vcmd->entry)) { + list_del_init(&vcmd->entry); + vhba_free_command(vcmd); + } + } else { + vcmd->status = VHBA_REQ_SENT; + } + + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + return ret; +} + +static long vhba_ctl_ioctl (struct file *file, unsigned int cmd, unsigned long arg) +{ + struct vhba_device *vdev = file->private_data; + struct vhba_host *vhost; + struct scsi_device *sdev; + + switch (cmd) { + case 0xBEEF001: { + vhost = platform_get_drvdata(&vhba_platform_device); + sdev = scsi_device_lookup(vhost->shost, 0, vdev->id, 0); + + if (sdev) { + int id[4] = { + sdev->host->host_no, + sdev->channel, + sdev->id, + sdev->lun + }; + + scsi_device_put(sdev); + + if (copy_to_user((void *)arg, id, sizeof(id))) { + return -EFAULT; + } + + return 0; + } else { + return -ENODEV; + } + } + } + + return -ENOTTY; +} + +#ifdef CONFIG_COMPAT +static long vhba_ctl_compat_ioctl (struct file *file, unsigned int cmd, unsigned long arg) +{ + unsigned long compat_arg = (unsigned long)compat_ptr(arg); + return vhba_ctl_ioctl(file, cmd, compat_arg); +} +#endif + +static unsigned int vhba_ctl_poll (struct file *file, poll_table *wait) +{ + struct vhba_device *vdev = file->private_data; + unsigned int mask = 0; + unsigned long flags; + + poll_wait(file, &vdev->cmd_wq, wait); + + spin_lock_irqsave(&vdev->cmd_lock, flags); + if (next_command(vdev)) { + mask |= POLLIN | POLLRDNORM; + } + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + return mask; +} + +static int vhba_ctl_open (struct inode *inode, struct file *file) +{ + struct vhba_device *vdev; + int retval; + + DPRINTK("open\n"); + + /* check if vhba is probed */ + if (!platform_get_drvdata(&vhba_platform_device)) { + return -ENODEV; + } + + vdev = vhba_device_alloc(); + if (!vdev) { + return -ENOMEM; + } + + if (!(retval = vhba_add_device(vdev))) { + file->private_data = vdev; + } + + vhba_device_put(vdev); + + return retval; +} + +static int vhba_ctl_release (struct inode *inode, struct file *file) +{ + struct vhba_device *vdev; + struct vhba_command *vcmd; + unsigned long flags; + + DPRINTK("release\n"); + + vdev = file->private_data; + + vhba_device_get(vdev); + vhba_remove_device(vdev); + + spin_lock_irqsave(&vdev->cmd_lock, flags); + list_for_each_entry(vcmd, &vdev->cmd_list, entry) { + WARN_ON(vcmd->status == VHBA_REQ_READING || vcmd->status == VHBA_REQ_WRITING); + + scmd_warn(vcmd->cmd, "device released with command %lu\n", vcmd->cmd->serial_number); + vcmd->cmd->result = DID_NO_CONNECT << 16; + vcmd->cmd->scsi_done(vcmd->cmd); + + vhba_free_command(vcmd); + } + INIT_LIST_HEAD(&vdev->cmd_list); + spin_unlock_irqrestore(&vdev->cmd_lock, flags); + + vhba_device_put(vdev); + + return 0; +} + +static struct file_operations vhba_ctl_fops = { + .owner = THIS_MODULE, + .open = vhba_ctl_open, + .release = vhba_ctl_release, + .read = vhba_ctl_read, + .write = vhba_ctl_write, + .poll = vhba_ctl_poll, + .unlocked_ioctl = vhba_ctl_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = vhba_ctl_compat_ioctl, +#endif +}; + +static struct miscdevice vhba_miscdev = { + .minor = MISC_DYNAMIC_MINOR, + .name = "vhba_ctl", + .fops = &vhba_ctl_fops, +}; + +static int vhba_probe (struct platform_device *pdev) +{ + struct Scsi_Host *shost; + struct vhba_host *vhost; + int i; + + shost = scsi_host_alloc(&vhba_template, sizeof(struct vhba_host)); + if (!shost) { + return -ENOMEM; + } + + shost->max_id = VHBA_MAX_ID; + /* we don't support lun > 0 */ + shost->max_lun = 1; + shost->max_cmd_len = MAX_COMMAND_SIZE; + + vhost = (struct vhba_host *)shost->hostdata; + memset(vhost, 0, sizeof(*vhost)); + + vhost->shost = shost; + vhost->num_devices = 0; + spin_lock_init(&vhost->dev_lock); + spin_lock_init(&vhost->cmd_lock); + INIT_WORK(&vhost->scan_devices, vhba_scan_devices); + vhost->cmd_next = 0; + for (i = 0; i < vhost->shost->can_queue; i++) { + vhost->commands[i].status = VHBA_REQ_FREE; + } + + platform_set_drvdata(pdev, vhost); + + if (scsi_add_host(shost, &pdev->dev)) { + scsi_host_put(shost); + return -ENOMEM; + } + + return 0; +} + +static int vhba_remove (struct platform_device *pdev) +{ + struct vhba_host *vhost; + struct Scsi_Host *shost; + + vhost = platform_get_drvdata(pdev); + shost = vhost->shost; + + scsi_remove_host(shost); + scsi_host_put(shost); + + return 0; +} + +static void vhba_release (struct device * dev) +{ + return; +} + +static struct platform_device vhba_platform_device = { + .name = "vhba", + .id = -1, + .dev = { + .release = vhba_release, + }, +}; + +static struct platform_driver vhba_platform_driver = { + .driver = { + .owner = THIS_MODULE, + .name = "vhba", + }, + .probe = vhba_probe, + .remove = vhba_remove, +}; + +static int __init vhba_init (void) +{ + int ret; + + ret = platform_device_register(&vhba_platform_device); + if (ret < 0) { + return ret; + } + + ret = platform_driver_register(&vhba_platform_driver); + if (ret < 0) { + platform_device_unregister(&vhba_platform_device); + return ret; + } + + ret = misc_register(&vhba_miscdev); + if (ret < 0) { + platform_driver_unregister(&vhba_platform_driver); + platform_device_unregister(&vhba_platform_device); + return ret; + } + + return 0; +} + +static void __exit vhba_exit(void) +{ + misc_deregister(&vhba_miscdev); + platform_driver_unregister(&vhba_platform_driver); + platform_device_unregister(&vhba_platform_device); +} + +module_init(vhba_init); +module_exit(vhba_exit); + diff --git a/drivers/tty/Kconfig b/drivers/tty/Kconfig index c01f450..6ba1451 100644 --- a/drivers/tty/Kconfig +++ b/drivers/tty/Kconfig @@ -75,6 +75,124 @@ config VT_CONSOLE_SLEEP def_bool y depends on VT_CONSOLE && PM_SLEEP +menuconfig VT_CKO + bool "Colored kernel message output" + depends on VT_CONSOLE + ---help--- + This option enables kernel messages to be emitted in + colors other than the default. + + The color value you need to enter is composed (OR-ed) + of a foreground and a background color. + + Foreground: + 0x00 = black, 0x08 = dark gray, + 0x01 = red, 0x09 = light red, + 0x02 = green, 0x0A = light green, + 0x03 = brown, 0x0B = yellow, + 0x04 = blue, 0x0C = light blue, + 0x05 = magenta, 0x0D = light magenta, + 0x06 = cyan, 0x0E = light cyan, + 0x07 = gray, 0x0F = white, + + (Foreground colors 0x08 to 0x0F do not work when a VGA + console font with 512 glyphs is used.) + + Background: + 0x00 = black, 0x40 = blue, + 0x10 = red, 0x50 = magenta, + 0x20 = green, 0x60 = cyan, + 0x30 = brown, 0x70 = gray, + + For example, 0x1F would yield white on red. + + If unsure, say N. + +config VT_PRINTK_EMERG_COLOR + hex "Emergency messages color" + range 0x00 0xFF + depends on VT_CKO + default 0x07 + ---help--- + This option defines with which color kernel emergency messages will + be printed to the console. + +config VT_PRINTK_ALERT_COLOR + hex "Alert messages color" + range 0x00 0xFF + depends on VT_CKO + default 0x07 + ---help--- + This option defines with which color kernel alert messages will + be printed to the console. + +config VT_PRINTK_CRIT_COLOR + hex "Critical messages color" + range 0x00 0xFF + depends on VT_CKO + default 0x07 + ---help--- + This option defines with which color kernel critical messages will + be printed to the console. + +config VT_PRINTK_ERR_COLOR + hex "Error messages color" + range 0x00 0xFF + depends on VT_CKO + default 0x07 + ---help--- + This option defines with which color kernel error messages will + be printed to the console. + +config VT_PRINTK_WARNING_COLOR + hex "Warning messages color" + range 0x00 0xFF + depends on VT_CKO + default 0x07 + ---help--- + This option defines with which color kernel warning messages will + be printed to the console. + +config VT_PRINTK_NOTICE_COLOR + hex "Notice messages color" + range 0x00 0xFF + depends on VT_CKO + default 0x07 + ---help--- + This option defines with which color kernel notice messages will + be printed to the console. + +config VT_PRINTK_INFO_COLOR + hex "Information messages color" + range 0x00 0xFF + depends on VT_CKO + default 0x07 + ---help--- + This option defines with which color kernel information messages will + be printed to the console. + +config VT_PRINTK_DEBUG_COLOR + hex "Debug messages color" + range 0x00 0xFF + depends on VT_CKO + default 0x07 + ---help--- + This option defines with which color kernel debug messages will + be printed to the console. + +config NR_TTY_DEVICES + int "Maximum tty device number" + depends on VT + range 12 63 + default 63 + ---help--- + This option is used to change the number of tty devices in /dev. + The default value is 63. The lowest number you can set is 12, + 63 is also the upper limit so we don't overrun the serial + consoles. + + If unsure, say 63. + config HW_CONSOLE bool depends on VT && !UML diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c index e46d628..3155b46 100644 --- a/drivers/tty/hvc/hvc_console.c +++ b/drivers/tty/hvc/hvc_console.c @@ -141,7 +141,7 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] = */ static void hvc_console_print(struct console *co, const char *b, - unsigned count) + unsigned count, unsigned loglevel) { char c[N_OUTBUF] __ALIGNED__; unsigned i = 0, n = 0; diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c index fa816b7..f9da806 100644 --- a/drivers/tty/hvc/hvc_xen.c +++ b/drivers/tty/hvc/hvc_xen.c @@ -599,7 +599,7 @@ console_initcall(xen_cons_init); #ifdef CONFIG_EARLY_PRINTK static void xenboot_write_console(struct console *console, const char *string, - unsigned len) + unsigned len, unsigned loglevel) { unsigned int linelen, off = 0; const char *pos; diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c index c9720a9..279a842 100644 --- a/drivers/tty/serial/8250/8250_core.c +++ b/drivers/tty/serial/8250/8250_core.c @@ -587,11 +587,11 @@ serial8250_register_ports(struct uart_driver *drv, struct device *dev) #ifdef CONFIG_SERIAL_8250_CONSOLE static void univ8250_console_write(struct console *co, const char *s, - unsigned int count) + unsigned int count, unsigned int loglevel) { struct uart_8250_port *up = &serial8250_ports[co->index]; - serial8250_console_write(up, s, count); + serial8250_console_write(up, s, count, loglevel); } static int univ8250_console_setup(struct console *co, char *options) diff --git a/drivers/tty/serial/8250/8250_early.c b/drivers/tty/serial/8250/8250_early.c index af62131..2aa1513 100644 --- a/drivers/tty/serial/8250/8250_early.c +++ b/drivers/tty/serial/8250/8250_early.c @@ -93,7 +93,7 @@ static void __init serial_putc(struct uart_port *port, int c) } static void __init early_serial8250_write(struct console *console, - const char *s, unsigned int count) + const char *s, unsigned int count, unsigned int loglevel) { struct earlycon_device *device = console->data; struct uart_port *port = &device->port; diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index 8d262bc..83aed61 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -2858,7 +2858,7 @@ static void serial8250_console_restore(struct uart_8250_port *up) * The console_lock must be held when we get here. */ void serial8250_console_write(struct uart_8250_port *up, const char *s, - unsigned int count) + unsigned int count, unsigned int loglevel) { struct uart_port *port = &up->port; unsigned long flags; diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c index bd51bdd..8e550a6 100644 --- a/drivers/tty/vt/vt.c +++ b/drivers/tty/vt/vt.c @@ -71,6 +71,7 @@ */ #include +#include #include #include #include @@ -2531,16 +2532,44 @@ int vt_kmsg_redirect(int new) return kmsg_con; } +#ifdef CONFIG_VT_CKO +static unsigned int printk_color[8] __read_mostly = { + CONFIG_VT_PRINTK_EMERG_COLOR, /* KERN_EMERG */ + CONFIG_VT_PRINTK_ALERT_COLOR, /* KERN_ALERT */ + CONFIG_VT_PRINTK_CRIT_COLOR, /* KERN_CRIT */ + CONFIG_VT_PRINTK_ERR_COLOR, /* KERN_ERR */ + CONFIG_VT_PRINTK_WARNING_COLOR, /* KERN_WARNING */ + CONFIG_VT_PRINTK_NOTICE_COLOR, /* KERN_NOTICE */ + CONFIG_VT_PRINTK_INFO_COLOR, /* KERN_INFO */ + CONFIG_VT_PRINTK_DEBUG_COLOR, /* KERN_DEBUG */ +}; +module_param_array(printk_color, uint, NULL, S_IRUGO | S_IWUSR); + +static inline void vc_set_color(struct vc_data *vc, unsigned char color) +{ + vc->vc_color = color_table[color & 0xF] | + (color_table[(color >> 4) & 0x7] << 4) | + (color & 0x80); + update_attr(vc); +} +#else +static unsigned int printk_color[8]; +static inline void vc_set_color(const struct vc_data *vc, unsigned char c) +{ +} +#endif + /* * Console on virtual terminal * * The console must be locked when we get here. */ -static void vt_console_print(struct console *co, const char *b, unsigned count) +static void vt_console_print(struct console *co, const char *b, unsigned count, + unsigned int loglevel) { struct vc_data *vc = vc_cons[fg_console].d; - unsigned char c; + unsigned char current_color, c; static DEFINE_SPINLOCK(printing_lock); const ushort *start; ushort cnt = 0; @@ -2576,11 +2605,20 @@ static void vt_console_print(struct console *co, const char *b, unsigned count) start = (ushort *)vc->vc_pos; + /* + * We always get a valid loglevel - <8> and "no level" is transformed + * to <4> in the typical kernel. + */ + current_color = printk_color[loglevel]; + vc_set_color(vc, current_color); + + /* Contrived structure to try to emulate original need_wrap behaviour * Problems caused when we have need_wrap set on '\n' character */ while (count--) { c = *b++; if (c == 10 || c == 13 || c == 8 || vc->vc_need_wrap) { + vc_set_color(vc, vc->vc_def_color); if (cnt > 0) { if (CON_IS_VISIBLE(vc)) vc->vc_sw->con_putcs(vc, start, cnt, vc->vc_y, vc->vc_x); @@ -2593,6 +2631,7 @@ static void vt_console_print(struct console *co, const char *b, unsigned count) bs(vc); start = (ushort *)vc->vc_pos; myx = vc->vc_x; + vc_set_color(vc, current_color); continue; } if (c != 13) @@ -2600,6 +2639,7 @@ static void vt_console_print(struct console *co, const char *b, unsigned count) cr(vc); start = (ushort *)vc->vc_pos; myx = vc->vc_x; + vc_set_color(vc, current_color); if (c == 10 || c == 13) continue; } @@ -2622,6 +2662,7 @@ static void vt_console_print(struct console *co, const char *b, unsigned count) vc->vc_need_wrap = 1; } } + vc_set_color(vc, vc->vc_def_color); set_cursor(vc); notify_update(vc); diff --git a/drivers/usb/gadget/function/u_serial.c b/drivers/usb/gadget/function/u_serial.c index 6af145f..5a76e4a 100644 --- a/drivers/usb/gadget/function/u_serial.c +++ b/drivers/usb/gadget/function/u_serial.c @@ -1514,8 +1514,13 @@ void gserial_disconnect(struct gserial *gser) gser->ioport = NULL; if (port->port.count > 0 || port->openclose) { wake_up_interruptible(&port->drain_wait); +#if 0 if (port->port.tty) tty_hangup(port->port.tty); +#else + if (port->port.tty) + stop_tty(port->port.tty); +#endif } spin_unlock_irqrestore(&port->port_lock, flags); diff --git a/fs/Kconfig b/fs/Kconfig index 9adee0d..dc04844 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -112,6 +112,7 @@ if BLOCK menu "DOS/FAT/NT Filesystems" source "fs/fat/Kconfig" +source "fs/exfat/Kconfig" source "fs/ntfs/Kconfig" endmenu diff --git a/fs/Makefile b/fs/Makefile index 79f5225..b8da1c2 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -76,6 +76,7 @@ obj-$(CONFIG_HUGETLBFS) += hugetlbfs/ obj-$(CONFIG_CODA_FS) += coda/ obj-$(CONFIG_MINIX_FS) += minix/ obj-$(CONFIG_FAT_FS) += fat/ +obj-$(CONFIG_EXFAT_FS) += exfat/ obj-$(CONFIG_BFS_FS) += bfs/ obj-$(CONFIG_ISO9660_FS) += isofs/ obj-$(CONFIG_HFSPLUS_FS) += hfsplus/ # Before hfs to find wrapped HFS+ diff --git a/fs/drop_caches.c b/fs/drop_caches.c index d72d52b..4f591f1 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -8,6 +8,7 @@ #include #include #include +#include #include "internal.h" /* A global variable is a bit ugly, but it keeps the code simple */ @@ -39,6 +40,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) iput(toput_inode); } +/* For TuxOnIce */ +void drop_pagecache(void) +{ + iterate_supers(drop_pagecache_sb, NULL); +} + int drop_caches_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/fs/exec.c b/fs/exec.c index dcd4ac7..ab176c0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -57,6 +57,8 @@ #include #include +#include + #include #include #include @@ -790,8 +792,10 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) if (err) goto exit; - if (name->name[0] != '\0') + if (name->name[0] != '\0') { fsnotify_open(file); + trace_open_exec(name->name); + } out: return file; diff --git b/fs/exfat/Kconfig b/fs/exfat/Kconfig new file mode 100644 index 0000000..78b32aa --- /dev/null +++ b/fs/exfat/Kconfig @@ -0,0 +1,39 @@ +config EXFAT_FS + tristate "exFAT fs support" + select NLS + help + This adds support for the exFAT file system. + +config EXFAT_DISCARD + bool "enable discard support" + depends on EXFAT_FS + default y + +config EXFAT_DELAYED_SYNC + bool "enable delayed sync" + depends on EXFAT_FS + default n + +config EXFAT_KERNEL_DEBUG + bool "enable kernel debug features via ioctl" + depends on EXFAT_FS + default n + +config EXFAT_DEBUG_MSG + bool "print debug messages" + depends on EXFAT_FS + default n + +config EXFAT_DEFAULT_CODEPAGE + int "Default codepage for exFAT" + default 437 + depends on EXFAT_FS + help + This option should be set to the codepage of your exFAT filesystems. + +config EXFAT_DEFAULT_IOCHARSET + string "Default iocharset for exFAT" + default "utf8" + depends on EXFAT_FS + help + Set this to the default input/output character set you'd like exFAT to use. diff --git b/fs/exfat/LICENSE b/fs/exfat/LICENSE new file mode 100644 index 0000000..d159169 --- /dev/null +++ b/fs/exfat/LICENSE @@ -0,0 +1,339 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License along + with this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. diff --git b/fs/exfat/Makefile b/fs/exfat/Makefile new file mode 100644 index 0000000..fb2be9d --- /dev/null +++ b/fs/exfat/Makefile @@ -0,0 +1,55 @@ +# +# Makefile for Linux FAT12/FAT16/FAT32(VFAT)/FAT64(ExFAT) filesystem driver. +# + +ifneq ($(KERNELRELEASE),) +# call from kernel build system + +obj-$(CONFIG_EXFAT_FS) += exfat.o + +exfat-objs := exfat_core.o exfat_super.o exfat_api.o exfat_blkdev.o exfat_cache.o \ + exfat_data.o exfat_bitmap.o exfat_nls.o exfat_oal.o exfat_upcase.o + +else +# external module build + +EXTRA_FLAGS += -I$(PWD) + +# +# KDIR is a path to a directory containing kernel source. +# It can be specified on the command line passed to make to enable the module to +# be built and installed for a kernel other than the one currently running. +# By default it is the path to the symbolic link created when +# the current kernel's modules were installed, but +# any valid path to the directory in which the target kernel's source is located +# can be provided on the command line. +# +KDIR := /lib/modules/$(shell uname -r)/build +MDIR := /lib/modules/$(shell uname -r) +PWD := $(shell pwd) +KREL := $(shell cd ${KDIR} && make -s kernelrelease) +PWD := $(shell pwd) + +export CONFIG_EXFAT_FS := m + +all: + $(MAKE) -C $(KDIR) M=$(PWD) modules + +clean: + $(MAKE) -C $(KDIR) M=$(PWD) clean + +help: + $(MAKE) -C $(KDIR) M=$(PWD) help + +install: exfat.ko + rm -f ${MDIR}/kernel/fs/exfat/exfat.ko + install -m644 -b -D exfat.ko ${MDIR}/kernel/fs/exfat/exfat.ko + depmod -aq + +uninstall: + rm -rf ${MDIR}/kernel/fs/exfat + depmod -aq + +endif + +.PHONY : all clean install uninstall diff --git b/fs/exfat/README.md b/fs/exfat/README.md new file mode 100644 index 0000000..3e1887f --- /dev/null +++ b/fs/exfat/README.md @@ -0,0 +1,74 @@ +exfat-nofuse +============ + +Linux non-fuse read/write kernel driver for the exFAT, FAT12, FAT16 and vfat (FAT32) file systems.
+Originally ported from Android kernel v3.0. + +Kudos to ksv1986 for the mutex patch!
+Thanks to JackNorris for being awesome and providing the clear_inode() patch.
+
+Big thanks to lqs for completing the driver!
+Big thanks to benpicco for fixing 3.11.y compatibility! + + +Special thanks to github user AndreiLux for spreading the word about the leak!
+ + +Installing as a stand-alone module: +==================================== + + make + sudo make install + +To load the driver manually, run this as root: + + modprobe exfat + +You may also specify custom toolchains by using CROSS_COMPILE flag, in my case: +>CROSS_COMPILE=../dorimanx-SG2-I9100-Kernel/android-toolchain/bin/arm-eabi- + +Installing as a part of the kernel: +====================================== + +Let's take [linux] as the path to your kernel source dir... + + cd [linux] + cp -rvf exfat-nofuse [linux]/fs/exfat + +edit [linux]/fs/Kconfig +``` + menu "DOS/FAT/NT Filesystems" + + source "fs/fat/Kconfig" + +source "fs/exfat/Kconfig" + source "fs/ntfs/Kconfig" + endmenu +``` + + +edit [linux]/fs/Makefile +``` + obj-$(CONFIG_FAT_FS) += fat/ + +obj-$(CONFIG_EXFAT_FS) += exfat/ + obj-$(CONFIG_BFS_FS) += bfs/ +``` + + cd [linux] + make menuconfig + +Go to: +> File systems > DOS/FAT/NT +> check exfat as MODULE (M) +> (437) Default codepage for exFAT +> (utf8) Default iocharset for exFAT + +> ESC to main menu +> Save an Alternate Configuration File +> ESC ESC + +build your kernel + +Have fun. + +Free Software for the Free Minds! +================================= diff --git b/fs/exfat/exfat_api.c b/fs/exfat/exfat_api.c new file mode 100644 index 0000000..32b29f0 --- /dev/null +++ b/fs/exfat/exfat_api.c @@ -0,0 +1,528 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_api.c */ +/* PURPOSE : exFAT API Glue Layer */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#include +#include +#include + +#include "exfat_version.h" +#include "exfat_config.h" +#include "exfat_data.h" +#include "exfat_oal.h" + +#include "exfat_nls.h" +#include "exfat_api.h" +#include "exfat_super.h" +#include "exfat_core.h" + +/*----------------------------------------------------------------------*/ +/* Constant & Macro Definitions */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* Global Variable Definitions */ +/*----------------------------------------------------------------------*/ + +extern struct semaphore z_sem; + +/*----------------------------------------------------------------------*/ +/* Local Variable Definitions */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* Local Function Declarations */ +/*----------------------------------------------------------------------*/ + +/*======================================================================*/ +/* Global Function Definitions */ +/* - All functions for global use have same return value format, */ +/* that is, FFS_SUCCESS on success and several FS error code on */ +/* various error condition. */ +/*======================================================================*/ + +/*----------------------------------------------------------------------*/ +/* exFAT Filesystem Init & Exit Functions */ +/*----------------------------------------------------------------------*/ + +int FsInit(void) +{ + return ffsInit(); +} + +int FsShutdown(void) +{ + return ffsShutdown(); +} + +/*----------------------------------------------------------------------*/ +/* Volume Management Functions */ +/*----------------------------------------------------------------------*/ + +/* FsMountVol : mount the file system volume */ +int FsMountVol(struct super_block *sb) +{ + int err; + + sm_P(&z_sem); + + err = buf_init(sb); + if (!err) + err = ffsMountVol(sb); + else + buf_shutdown(sb); + + sm_V(&z_sem); + + return err; +} /* end of FsMountVol */ + +/* FsUmountVol : unmount the file system volume */ +int FsUmountVol(struct super_block *sb) +{ + int err; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + sm_P(&z_sem); + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsUmountVol(sb); + buf_shutdown(sb); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + sm_V(&z_sem); + + return err; +} /* end of FsUmountVol */ + +/* FsGetVolInfo : get the information of a file system volume */ +int FsGetVolInfo(struct super_block *sb, VOL_INFO_T *info) +{ + int err; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of pointer parameters */ + if (info == NULL) + return FFS_ERROR; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsGetVolInfo(sb, info); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsGetVolInfo */ + +/* FsSyncVol : synchronize a file system volume */ +int FsSyncVol(struct super_block *sb, int do_sync) +{ + int err; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsSyncVol(sb, do_sync); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsSyncVol */ + + +/*----------------------------------------------------------------------*/ +/* File Operation Functions */ +/*----------------------------------------------------------------------*/ + +/* FsCreateFile : create a file */ +int FsLookupFile(struct inode *inode, char *path, FILE_ID_T *fid) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of pointer parameters */ + if ((fid == NULL) || (path == NULL) || (*path == '\0')) + return FFS_ERROR; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsLookupFile(inode, path, fid); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsLookupFile */ + +/* FsCreateFile : create a file */ +int FsCreateFile(struct inode *inode, char *path, u8 mode, FILE_ID_T *fid) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of pointer parameters */ + if ((fid == NULL) || (path == NULL) || (*path == '\0')) + return FFS_ERROR; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsCreateFile(inode, path, mode, fid); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsCreateFile */ + +int FsReadFile(struct inode *inode, FILE_ID_T *fid, void *buffer, u64 count, u64 *rcount) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of the given file id */ + if (fid == NULL) + return FFS_INVALIDFID; + + /* check the validity of pointer parameters */ + if (buffer == NULL) + return FFS_ERROR; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsReadFile(inode, fid, buffer, count, rcount); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsReadFile */ + +int FsWriteFile(struct inode *inode, FILE_ID_T *fid, void *buffer, u64 count, u64 *wcount) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of the given file id */ + if (fid == NULL) + return FFS_INVALIDFID; + + /* check the validity of pointer parameters */ + if (buffer == NULL) + return FFS_ERROR; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsWriteFile(inode, fid, buffer, count, wcount); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsWriteFile */ + +/* FsTruncateFile : resize the file length */ +int FsTruncateFile(struct inode *inode, u64 old_size, u64 new_size) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + DPRINTK("FsTruncateFile entered (inode %p size %llu)\n", inode, new_size); + + err = ffsTruncateFile(inode, old_size, new_size); + + DPRINTK("FsTruncateFile exitted (%d)\n", err); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsTruncateFile */ + +/* FsMoveFile : move(rename) a old file into a new file */ +int FsMoveFile(struct inode *old_parent_inode, FILE_ID_T *fid, struct inode *new_parent_inode, struct dentry *new_dentry) +{ + int err; + struct super_block *sb = old_parent_inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of the given file id */ + if (fid == NULL) + return FFS_INVALIDFID; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsMoveFile(old_parent_inode, fid, new_parent_inode, new_dentry); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsMoveFile */ + +/* FsRemoveFile : remove a file */ +int FsRemoveFile(struct inode *inode, FILE_ID_T *fid) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of the given file id */ + if (fid == NULL) + return FFS_INVALIDFID; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsRemoveFile(inode, fid); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsRemoveFile */ + +/* FsSetAttr : set the attribute of a given file */ +int FsSetAttr(struct inode *inode, u32 attr) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsSetAttr(inode, attr); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsSetAttr */ + +/* FsReadStat : get the information of a given file */ +int FsReadStat(struct inode *inode, DIR_ENTRY_T *info) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsGetStat(inode, info); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsReadStat */ + +/* FsWriteStat : set the information of a given file */ +int FsWriteStat(struct inode *inode, DIR_ENTRY_T *info) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + DPRINTK("FsWriteStat entered (inode %p info %p\n", inode, info); + + err = ffsSetStat(inode, info); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + DPRINTK("FsWriteStat exited (%d)\n", err); + + return err; +} /* end of FsWriteStat */ + +/* FsMapCluster : return the cluster number in the given cluster offset */ +int FsMapCluster(struct inode *inode, s32 clu_offset, u32 *clu) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of pointer parameters */ + if (clu == NULL) + return FFS_ERROR; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsMapCluster(inode, clu_offset, clu); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsMapCluster */ + +/*----------------------------------------------------------------------*/ +/* Directory Operation Functions */ +/*----------------------------------------------------------------------*/ + +/* FsCreateDir : create(make) a directory */ +int FsCreateDir(struct inode *inode, char *path, FILE_ID_T *fid) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of pointer parameters */ + if ((fid == NULL) || (path == NULL) || (*path == '\0')) + return FFS_ERROR; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsCreateDir(inode, path, fid); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsCreateDir */ + +/* FsReadDir : read a directory entry from the opened directory */ +int FsReadDir(struct inode *inode, DIR_ENTRY_T *dir_entry) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of pointer parameters */ + if (dir_entry == NULL) + return FFS_ERROR; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsReadDir(inode, dir_entry); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsReadDir */ + +/* FsRemoveDir : remove a directory */ +int FsRemoveDir(struct inode *inode, FILE_ID_T *fid) +{ + int err; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of the given file id */ + if (fid == NULL) + return FFS_INVALIDFID; + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + err = ffsRemoveDir(inode, fid); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return err; +} /* end of FsRemoveDir */ + +EXPORT_SYMBOL(FsMountVol); +EXPORT_SYMBOL(FsUmountVol); +EXPORT_SYMBOL(FsGetVolInfo); +EXPORT_SYMBOL(FsSyncVol); +EXPORT_SYMBOL(FsLookupFile); +EXPORT_SYMBOL(FsCreateFile); +EXPORT_SYMBOL(FsReadFile); +EXPORT_SYMBOL(FsWriteFile); +EXPORT_SYMBOL(FsTruncateFile); +EXPORT_SYMBOL(FsMoveFile); +EXPORT_SYMBOL(FsRemoveFile); +EXPORT_SYMBOL(FsSetAttr); +EXPORT_SYMBOL(FsReadStat); +EXPORT_SYMBOL(FsWriteStat); +EXPORT_SYMBOL(FsMapCluster); +EXPORT_SYMBOL(FsCreateDir); +EXPORT_SYMBOL(FsReadDir); +EXPORT_SYMBOL(FsRemoveDir); + +#ifdef CONFIG_EXFAT_KERNEL_DEBUG +/* FsReleaseCache: Release FAT & buf cache */ +int FsReleaseCache(struct super_block *sb) +{ + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* acquire the lock for file system critical section */ + sm_P(&p_fs->v_sem); + + FAT_release_all(sb); + buf_release_all(sb); + + /* release the lock for file system critical section */ + sm_V(&p_fs->v_sem); + + return 0; +} +/* FsReleaseCache */ + +EXPORT_SYMBOL(FsReleaseCache); +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ + +/*======================================================================*/ +/* Local Function Definitions */ +/*======================================================================*/ diff --git b/fs/exfat/exfat_api.h b/fs/exfat/exfat_api.h new file mode 100644 index 0000000..84bdf61 --- /dev/null +++ b/fs/exfat/exfat_api.h @@ -0,0 +1,206 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_api.h */ +/* PURPOSE : Header File for exFAT API Glue Layer */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#ifndef _EXFAT_API_H +#define _EXFAT_API_H + +#include +#include "exfat_config.h" + +/*----------------------------------------------------------------------*/ +/* Constant & Macro Definitions */ +/*----------------------------------------------------------------------*/ + +#define EXFAT_SUPER_MAGIC (0x2011BAB0L) +#define EXFAT_ROOT_INO 1 + +/* FAT types */ +#define FAT12 0x01 /* FAT12 */ +#define FAT16 0x0E /* Win95 FAT16 (LBA) */ +#define FAT32 0x0C /* Win95 FAT32 (LBA) */ +#define EXFAT 0x07 /* exFAT */ + +/* file name lengths */ +#define MAX_CHARSET_SIZE 3 /* max size of multi-byte character */ +#define MAX_PATH_DEPTH 15 /* max depth of path name */ +#define MAX_NAME_LENGTH 256 /* max len of file name including NULL */ +#define MAX_PATH_LENGTH 260 /* max len of path name including NULL */ +#define DOS_NAME_LENGTH 11 /* DOS file name length excluding NULL */ +#define DOS_PATH_LENGTH 80 /* DOS path name length excluding NULL */ + +/* file attributes */ +#define ATTR_NORMAL 0x0000 +#define ATTR_READONLY 0x0001 +#define ATTR_HIDDEN 0x0002 +#define ATTR_SYSTEM 0x0004 +#define ATTR_VOLUME 0x0008 +#define ATTR_SUBDIR 0x0010 +#define ATTR_ARCHIVE 0x0020 +#define ATTR_SYMLINK 0x0040 +#define ATTR_EXTEND 0x000F +#define ATTR_RWMASK 0x007E + +/* file creation modes */ +#define FM_REGULAR 0x00 +#define FM_SYMLINK 0x40 + +/* return values */ +#define FFS_SUCCESS 0 +#define FFS_MEDIAERR 1 +#define FFS_FORMATERR 2 +#define FFS_MOUNTED 3 +#define FFS_NOTMOUNTED 4 +#define FFS_ALIGNMENTERR 5 +#define FFS_SEMAPHOREERR 6 +#define FFS_INVALIDPATH 7 +#define FFS_INVALIDFID 8 +#define FFS_NOTFOUND 9 +#define FFS_FILEEXIST 10 +#define FFS_PERMISSIONERR 11 +#define FFS_NOTOPENED 12 +#define FFS_MAXOPENED 13 +#define FFS_FULL 14 +#define FFS_EOF 15 +#define FFS_DIRBUSY 16 +#define FFS_MEMORYERR 17 +#define FFS_NAMETOOLONG 18 +#define FFS_ERROR 19 + +/*----------------------------------------------------------------------*/ +/* Type Definitions */ +/*----------------------------------------------------------------------*/ + +typedef struct { + u16 Year; + u16 Month; + u16 Day; + u16 Hour; + u16 Minute; + u16 Second; + u16 MilliSecond; +} DATE_TIME_T; + +typedef struct { + u32 Offset; /* start sector number of the partition */ + u32 Size; /* in sectors */ +} PART_INFO_T; + +typedef struct { + u32 SecSize; /* sector size in bytes */ + u32 DevSize; /* block device size in sectors */ +} DEV_INFO_T; + +typedef struct { + u32 FatType; + u32 ClusterSize; + u32 NumClusters; + u32 FreeClusters; + u32 UsedClusters; +} VOL_INFO_T; + +/* directory structure */ +typedef struct { + u32 dir; + s32 size; + u8 flags; +} CHAIN_T; + +/* file id structure */ +typedef struct { + CHAIN_T dir; + s32 entry; + u32 type; + u32 attr; + u32 start_clu; + u64 size; + u8 flags; + s64 rwoffset; + s32 hint_last_off; + u32 hint_last_clu; +} FILE_ID_T; + +typedef struct { + char Name[MAX_NAME_LENGTH * MAX_CHARSET_SIZE]; + char ShortName[DOS_NAME_LENGTH + 2]; /* used only for FAT12/16/32, not used for exFAT */ + u32 Attr; + u64 Size; + u32 NumSubdirs; + DATE_TIME_T CreateTimestamp; + DATE_TIME_T ModifyTimestamp; + DATE_TIME_T AccessTimestamp; +} DIR_ENTRY_T; + +/*======================================================================*/ +/* */ +/* API FUNCTION DECLARATIONS */ +/* (CHANGE THIS PART IF REQUIRED) */ +/* */ +/*======================================================================*/ + +/*----------------------------------------------------------------------*/ +/* External Function Declarations */ +/*----------------------------------------------------------------------*/ + +/* file system initialization & shutdown functions */ + int FsInit(void); + int FsShutdown(void); + +/* volume management functions */ + int FsMountVol(struct super_block *sb); + int FsUmountVol(struct super_block *sb); + int FsGetVolInfo(struct super_block *sb, VOL_INFO_T *info); + int FsSyncVol(struct super_block *sb, int do_sync); + +/* file management functions */ + int FsLookupFile(struct inode *inode, char *path, FILE_ID_T *fid); + int FsCreateFile(struct inode *inode, char *path, u8 mode, FILE_ID_T *fid); + int FsReadFile(struct inode *inode, FILE_ID_T *fid, void *buffer, u64 count, u64 *rcount); + int FsWriteFile(struct inode *inode, FILE_ID_T *fid, void *buffer, u64 count, u64 *wcount); + int FsTruncateFile(struct inode *inode, u64 old_size, u64 new_size); + int FsMoveFile(struct inode *old_parent_inode, FILE_ID_T *fid, struct inode *new_parent_inode, struct dentry *new_dentry); + int FsRemoveFile(struct inode *inode, FILE_ID_T *fid); + int FsSetAttr(struct inode *inode, u32 attr); + int FsReadStat(struct inode *inode, DIR_ENTRY_T *info); + int FsWriteStat(struct inode *inode, DIR_ENTRY_T *info); + int FsMapCluster(struct inode *inode, s32 clu_offset, u32 *clu); + +/* directory management functions */ + int FsCreateDir(struct inode *inode, char *path, FILE_ID_T *fid); + int FsReadDir(struct inode *inode, DIR_ENTRY_T *dir_entry); + int FsRemoveDir(struct inode *inode, FILE_ID_T *fid); + +/* debug functions */ +s32 FsReleaseCache(struct super_block *sb); + +#endif /* _EXFAT_API_H */ diff --git b/fs/exfat/exfat_bitmap.c b/fs/exfat/exfat_bitmap.c new file mode 100644 index 0000000..b0672dd --- /dev/null +++ b/fs/exfat/exfat_bitmap.c @@ -0,0 +1,63 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_global.c */ +/* PURPOSE : exFAT Miscellaneous Functions */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#include "exfat_config.h" +#include "exfat_bitmap.h" + +/*----------------------------------------------------------------------*/ +/* Bitmap Manipulation Functions */ +/*----------------------------------------------------------------------*/ + +#define BITMAP_LOC(v) ((v) >> 3) +#define BITMAP_SHIFT(v) ((v) & 0x07) + +s32 exfat_bitmap_test(u8 *bitmap, int i) +{ + u8 data; + + data = bitmap[BITMAP_LOC(i)]; + if ((data >> BITMAP_SHIFT(i)) & 0x01) + return 1; + return 0; +} /* end of Bitmap_test */ + +void exfat_bitmap_set(u8 *bitmap, int i) +{ + bitmap[BITMAP_LOC(i)] |= (0x01 << BITMAP_SHIFT(i)); +} /* end of Bitmap_set */ + +void exfat_bitmap_clear(u8 *bitmap, int i) +{ + bitmap[BITMAP_LOC(i)] &= ~(0x01 << BITMAP_SHIFT(i)); +} /* end of Bitmap_clear */ diff --git b/fs/exfat/exfat_bitmap.h b/fs/exfat/exfat_bitmap.h new file mode 100644 index 0000000..4f482c7 --- /dev/null +++ b/fs/exfat/exfat_bitmap.h @@ -0,0 +1,55 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_global.h */ +/* PURPOSE : Header File for exFAT Global Definitions & Misc Functions */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#ifndef _EXFAT_BITMAP_H +#define _EXFAT_BITMAP_H + +#include + +/*======================================================================*/ +/* */ +/* LIBRARY FUNCTION DECLARATIONS -- OTHER UTILITY FUNCTIONS */ +/* (DO NOT CHANGE THIS PART !!) */ +/* */ +/*======================================================================*/ + +/*----------------------------------------------------------------------*/ +/* Bitmap Manipulation Functions */ +/*----------------------------------------------------------------------*/ + +s32 exfat_bitmap_test(u8 *bitmap, int i); +void exfat_bitmap_set(u8 *bitmap, int i); +void exfat_bitmap_clear(u8 *bitmpa, int i); + +#endif /* _EXFAT_BITMAP_H */ diff --git b/fs/exfat/exfat_blkdev.c b/fs/exfat/exfat_blkdev.c new file mode 100644 index 0000000..02fa4fe --- /dev/null +++ b/fs/exfat/exfat_blkdev.c @@ -0,0 +1,197 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_blkdev.c */ +/* PURPOSE : exFAT Block Device Driver Glue Layer */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#include +#include +#include "exfat_config.h" +#include "exfat_blkdev.h" +#include "exfat_data.h" +#include "exfat_api.h" +#include "exfat_super.h" + +/*----------------------------------------------------------------------*/ +/* Constant & Macro Definitions */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* Global Variable Definitions */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* Local Variable Definitions */ +/*----------------------------------------------------------------------*/ + +/*======================================================================*/ +/* Function Definitions */ +/*======================================================================*/ + +s32 bdev_init(void) +{ + return FFS_SUCCESS; +} + +s32 bdev_shutdown(void) +{ + return FFS_SUCCESS; +} + +s32 bdev_open(struct super_block *sb) +{ + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + if (p_bd->opened) + return FFS_SUCCESS; + + p_bd->sector_size = bdev_logical_block_size(sb->s_bdev); + p_bd->sector_size_bits = ilog2(p_bd->sector_size); + p_bd->sector_size_mask = p_bd->sector_size - 1; + p_bd->num_sectors = i_size_read(sb->s_bdev->bd_inode) >> p_bd->sector_size_bits; + + p_bd->opened = TRUE; + + return FFS_SUCCESS; +} + +s32 bdev_close(struct super_block *sb) +{ + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + if (!p_bd->opened) + return FFS_SUCCESS; + + p_bd->opened = FALSE; + return FFS_SUCCESS; +} + +s32 bdev_read(struct super_block *sb, u32 secno, struct buffer_head **bh, u32 num_secs, s32 read) +{ + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); +#ifdef CONFIG_EXFAT_KERNEL_DEBUG + struct exfat_sb_info *sbi = EXFAT_SB(sb); + long flags = sbi->debug_flags; + + if (flags & EXFAT_DEBUGFLAGS_ERROR_RW) + return FFS_MEDIAERR; +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ + + if (!p_bd->opened) + return FFS_MEDIAERR; + + if (*bh) + __brelse(*bh); + + if (read) + *bh = __bread(sb->s_bdev, secno, num_secs << p_bd->sector_size_bits); + else + *bh = __getblk(sb->s_bdev, secno, num_secs << p_bd->sector_size_bits); + + if (*bh) + return FFS_SUCCESS; + + WARN(!p_fs->dev_ejected, + "[EXFAT] No bh, device seems wrong or to be ejected.\n"); + + return FFS_MEDIAERR; +} + +s32 bdev_write(struct super_block *sb, u32 secno, struct buffer_head *bh, u32 num_secs, s32 sync) +{ + s32 count; + struct buffer_head *bh2; + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); +#ifdef CONFIG_EXFAT_KERNEL_DEBUG + struct exfat_sb_info *sbi = EXFAT_SB(sb); + long flags = sbi->debug_flags; + + if (flags & EXFAT_DEBUGFLAGS_ERROR_RW) + return FFS_MEDIAERR; +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ + + if (!p_bd->opened) + return FFS_MEDIAERR; + + if (secno == bh->b_blocknr) { + lock_buffer(bh); + set_buffer_uptodate(bh); + mark_buffer_dirty(bh); + unlock_buffer(bh); + if (sync && (sync_dirty_buffer(bh) != 0)) + return FFS_MEDIAERR; + } else { + count = num_secs << p_bd->sector_size_bits; + + bh2 = __getblk(sb->s_bdev, secno, count); + + if (bh2 == NULL) + goto no_bh; + + lock_buffer(bh2); + memcpy(bh2->b_data, bh->b_data, count); + set_buffer_uptodate(bh2); + mark_buffer_dirty(bh2); + unlock_buffer(bh2); + if (sync && (sync_dirty_buffer(bh2) != 0)) { + __brelse(bh2); + goto no_bh; + } + __brelse(bh2); + } + + return FFS_SUCCESS; + +no_bh: + WARN(!p_fs->dev_ejected, + "[EXFAT] No bh, device seems wrong or to be ejected.\n"); + + return FFS_MEDIAERR; +} + +s32 bdev_sync(struct super_block *sb) +{ + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); +#ifdef CONFIG_EXFAT_KERNEL_DEBUG + struct exfat_sb_info *sbi = EXFAT_SB(sb); + long flags = sbi->debug_flags; + + if (flags & EXFAT_DEBUGFLAGS_ERROR_RW) + return FFS_MEDIAERR; +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ + + if (!p_bd->opened) + return FFS_MEDIAERR; + + return sync_blockdev(sb->s_bdev); +} diff --git b/fs/exfat/exfat_blkdev.h b/fs/exfat/exfat_blkdev.h new file mode 100644 index 0000000..169d25d --- /dev/null +++ b/fs/exfat/exfat_blkdev.h @@ -0,0 +1,73 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_blkdev.h */ +/* PURPOSE : Header File for exFAT Block Device Driver Glue Layer */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#ifndef _EXFAT_BLKDEV_H +#define _EXFAT_BLKDEV_H + +#include +#include "exfat_config.h" + +/*----------------------------------------------------------------------*/ +/* Constant & Macro Definitions (Non-Configurable) */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* Type Definitions */ +/*----------------------------------------------------------------------*/ + +typedef struct __BD_INFO_T { + s32 sector_size; /* in bytes */ + s32 sector_size_bits; + s32 sector_size_mask; + s32 num_sectors; /* total number of sectors in this block device */ + bool opened; /* opened or not */ +} BD_INFO_T; + +/*----------------------------------------------------------------------*/ +/* External Variable Declarations */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* External Function Declarations */ +/*----------------------------------------------------------------------*/ + +s32 bdev_init(void); +s32 bdev_shutdown(void); +s32 bdev_open(struct super_block *sb); +s32 bdev_close(struct super_block *sb); +s32 bdev_read(struct super_block *sb, u32 secno, struct buffer_head **bh, u32 num_secs, s32 read); +s32 bdev_write(struct super_block *sb, u32 secno, struct buffer_head *bh, u32 num_secs, s32 sync); +s32 bdev_sync(struct super_block *sb); + +#endif /* _EXFAT_BLKDEV_H */ diff --git b/fs/exfat/exfat_cache.c b/fs/exfat/exfat_cache.c new file mode 100644 index 0000000..e6ca88b --- /dev/null +++ b/fs/exfat/exfat_cache.c @@ -0,0 +1,780 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_cache.c */ +/* PURPOSE : exFAT Cache Manager */ +/* (FAT Cache & Buffer Cache) */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Sung-Kwan Kim] : first writing */ +/* */ +/************************************************************************/ + +#include "exfat_config.h" +#include "exfat_data.h" + +#include "exfat_cache.h" +#include "exfat_super.h" +#include "exfat_core.h" + +/*----------------------------------------------------------------------*/ +/* Global Variable Definitions */ +/*----------------------------------------------------------------------*/ + +#define sm_P(s) +#define sm_V(s) + +static s32 __FAT_read(struct super_block *sb, u32 loc, u32 *content); +static s32 __FAT_write(struct super_block *sb, u32 loc, u32 content); + +static BUF_CACHE_T *FAT_cache_find(struct super_block *sb, u32 sec); +static BUF_CACHE_T *FAT_cache_get(struct super_block *sb, u32 sec); +static void FAT_cache_insert_hash(struct super_block *sb, BUF_CACHE_T *bp); +static void FAT_cache_remove_hash(BUF_CACHE_T *bp); + +static u8 *__buf_getblk(struct super_block *sb, u32 sec); + +static BUF_CACHE_T *buf_cache_find(struct super_block *sb, u32 sec); +static BUF_CACHE_T *buf_cache_get(struct super_block *sb, u32 sec); +static void buf_cache_insert_hash(struct super_block *sb, BUF_CACHE_T *bp); +static void buf_cache_remove_hash(BUF_CACHE_T *bp); + +static void push_to_mru(BUF_CACHE_T *bp, BUF_CACHE_T *list); +static void push_to_lru(BUF_CACHE_T *bp, BUF_CACHE_T *list); +static void move_to_mru(BUF_CACHE_T *bp, BUF_CACHE_T *list); +static void move_to_lru(BUF_CACHE_T *bp, BUF_CACHE_T *list); + +/*======================================================================*/ +/* Cache Initialization Functions */ +/*======================================================================*/ + +s32 buf_init(struct super_block *sb) +{ + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + int i; + + /* LRU list */ + p_fs->FAT_cache_lru_list.next = p_fs->FAT_cache_lru_list.prev = &p_fs->FAT_cache_lru_list; + + for (i = 0; i < FAT_CACHE_SIZE; i++) { + p_fs->FAT_cache_array[i].drv = -1; + p_fs->FAT_cache_array[i].sec = ~0; + p_fs->FAT_cache_array[i].flag = 0; + p_fs->FAT_cache_array[i].buf_bh = NULL; + p_fs->FAT_cache_array[i].prev = p_fs->FAT_cache_array[i].next = NULL; + push_to_mru(&(p_fs->FAT_cache_array[i]), &p_fs->FAT_cache_lru_list); + } + + p_fs->buf_cache_lru_list.next = p_fs->buf_cache_lru_list.prev = &p_fs->buf_cache_lru_list; + + for (i = 0; i < BUF_CACHE_SIZE; i++) { + p_fs->buf_cache_array[i].drv = -1; + p_fs->buf_cache_array[i].sec = ~0; + p_fs->buf_cache_array[i].flag = 0; + p_fs->buf_cache_array[i].buf_bh = NULL; + p_fs->buf_cache_array[i].prev = p_fs->buf_cache_array[i].next = NULL; + push_to_mru(&(p_fs->buf_cache_array[i]), &p_fs->buf_cache_lru_list); + } + + /* HASH list */ + for (i = 0; i < FAT_CACHE_HASH_SIZE; i++) { + p_fs->FAT_cache_hash_list[i].drv = -1; + p_fs->FAT_cache_hash_list[i].sec = ~0; + p_fs->FAT_cache_hash_list[i].hash_next = p_fs->FAT_cache_hash_list[i].hash_prev = &(p_fs->FAT_cache_hash_list[i]); + } + + for (i = 0; i < FAT_CACHE_SIZE; i++) + FAT_cache_insert_hash(sb, &(p_fs->FAT_cache_array[i])); + + for (i = 0; i < BUF_CACHE_HASH_SIZE; i++) { + p_fs->buf_cache_hash_list[i].drv = -1; + p_fs->buf_cache_hash_list[i].sec = ~0; + p_fs->buf_cache_hash_list[i].hash_next = p_fs->buf_cache_hash_list[i].hash_prev = &(p_fs->buf_cache_hash_list[i]); + } + + for (i = 0; i < BUF_CACHE_SIZE; i++) + buf_cache_insert_hash(sb, &(p_fs->buf_cache_array[i])); + + return FFS_SUCCESS; +} /* end of buf_init */ + +s32 buf_shutdown(struct super_block *sb) +{ + return FFS_SUCCESS; +} /* end of buf_shutdown */ + +/*======================================================================*/ +/* FAT Read/Write Functions */ +/*======================================================================*/ + +/* in : sb, loc + * out: content + * returns 0 on success + * -1 on error + */ +s32 FAT_read(struct super_block *sb, u32 loc, u32 *content) +{ + s32 ret; + + sm_P(&f_sem); + + ret = __FAT_read(sb, loc, content); + + sm_V(&f_sem); + + return ret; +} /* end of FAT_read */ + +s32 FAT_write(struct super_block *sb, u32 loc, u32 content) +{ + s32 ret; + + sm_P(&f_sem); + + ret = __FAT_write(sb, loc, content); + + sm_V(&f_sem); + + return ret; +} /* end of FAT_write */ + +static s32 __FAT_read(struct super_block *sb, u32 loc, u32 *content) +{ + s32 off; + u32 sec, _content; + u8 *fat_sector, *fat_entry; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + if (p_fs->vol_type == FAT12) { + sec = p_fs->FAT1_start_sector + ((loc + (loc >> 1)) >> p_bd->sector_size_bits); + off = (loc + (loc >> 1)) & p_bd->sector_size_mask; + + if (off == (p_bd->sector_size-1)) { + fat_sector = FAT_getblk(sb, sec); + if (!fat_sector) + return -1; + + _content = (u32) fat_sector[off]; + + fat_sector = FAT_getblk(sb, ++sec); + if (!fat_sector) + return -1; + + _content |= (u32) fat_sector[0] << 8; + } else { + fat_sector = FAT_getblk(sb, sec); + if (!fat_sector) + return -1; + + fat_entry = &(fat_sector[off]); + _content = GET16(fat_entry); + } + + if (loc & 1) + _content >>= 4; + + _content &= 0x00000FFF; + + if (_content >= CLUSTER_16(0x0FF8)) { + *content = CLUSTER_32(~0); + return 0; + } else { + *content = CLUSTER_32(_content); + return 0; + } + } else if (p_fs->vol_type == FAT16) { + sec = p_fs->FAT1_start_sector + (loc >> (p_bd->sector_size_bits-1)); + off = (loc << 1) & p_bd->sector_size_mask; + + fat_sector = FAT_getblk(sb, sec); + if (!fat_sector) + return -1; + + fat_entry = &(fat_sector[off]); + + _content = GET16_A(fat_entry); + + _content &= 0x0000FFFF; + + if (_content >= CLUSTER_16(0xFFF8)) { + *content = CLUSTER_32(~0); + return 0; + } else { + *content = CLUSTER_32(_content); + return 0; + } + } else if (p_fs->vol_type == FAT32) { + sec = p_fs->FAT1_start_sector + (loc >> (p_bd->sector_size_bits-2)); + off = (loc << 2) & p_bd->sector_size_mask; + + fat_sector = FAT_getblk(sb, sec); + if (!fat_sector) + return -1; + + fat_entry = &(fat_sector[off]); + + _content = GET32_A(fat_entry); + + _content &= 0x0FFFFFFF; + + if (_content >= CLUSTER_32(0x0FFFFFF8)) { + *content = CLUSTER_32(~0); + return 0; + } else { + *content = CLUSTER_32(_content); + return 0; + } + } else { + sec = p_fs->FAT1_start_sector + (loc >> (p_bd->sector_size_bits-2)); + off = (loc << 2) & p_bd->sector_size_mask; + + fat_sector = FAT_getblk(sb, sec); + if (!fat_sector) + return -1; + + fat_entry = &(fat_sector[off]); + _content = GET32_A(fat_entry); + + if (_content >= CLUSTER_32(0xFFFFFFF8)) { + *content = CLUSTER_32(~0); + return 0; + } else { + *content = CLUSTER_32(_content); + return 0; + } + } + + *content = CLUSTER_32(~0); + return 0; +} /* end of __FAT_read */ + +static s32 __FAT_write(struct super_block *sb, u32 loc, u32 content) +{ + s32 off; + u32 sec; + u8 *fat_sector, *fat_entry; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + if (p_fs->vol_type == FAT12) { + + content &= 0x00000FFF; + + sec = p_fs->FAT1_start_sector + ((loc + (loc >> 1)) >> p_bd->sector_size_bits); + off = (loc + (loc >> 1)) & p_bd->sector_size_mask; + + fat_sector = FAT_getblk(sb, sec); + if (!fat_sector) + return -1; + + if (loc & 1) { /* odd */ + + content <<= 4; + + if (off == (p_bd->sector_size-1)) { + fat_sector[off] = (u8)(content | (fat_sector[off] & 0x0F)); + FAT_modify(sb, sec); + + fat_sector = FAT_getblk(sb, ++sec); + if (!fat_sector) + return -1; + + fat_sector[0] = (u8)(content >> 8); + } else { + fat_entry = &(fat_sector[off]); + content |= GET16(fat_entry) & 0x000F; + + SET16(fat_entry, content); + } + } else { /* even */ + fat_sector[off] = (u8)(content); + + if (off == (p_bd->sector_size-1)) { + fat_sector[off] = (u8)(content); + FAT_modify(sb, sec); + + fat_sector = FAT_getblk(sb, ++sec); + fat_sector[0] = (u8)((fat_sector[0] & 0xF0) | (content >> 8)); + } else { + fat_entry = &(fat_sector[off]); + content |= GET16(fat_entry) & 0xF000; + + SET16(fat_entry, content); + } + } + } + + else if (p_fs->vol_type == FAT16) { + + content &= 0x0000FFFF; + + sec = p_fs->FAT1_start_sector + (loc >> (p_bd->sector_size_bits-1)); + off = (loc << 1) & p_bd->sector_size_mask; + + fat_sector = FAT_getblk(sb, sec); + if (!fat_sector) + return -1; + + fat_entry = &(fat_sector[off]); + + SET16_A(fat_entry, content); + } + + else if (p_fs->vol_type == FAT32) { + + content &= 0x0FFFFFFF; + + sec = p_fs->FAT1_start_sector + (loc >> (p_bd->sector_size_bits-2)); + off = (loc << 2) & p_bd->sector_size_mask; + + fat_sector = FAT_getblk(sb, sec); + if (!fat_sector) + return -1; + + fat_entry = &(fat_sector[off]); + + content |= GET32_A(fat_entry) & 0xF0000000; + + SET32_A(fat_entry, content); + } + + else { /* p_fs->vol_type == EXFAT */ + + sec = p_fs->FAT1_start_sector + (loc >> (p_bd->sector_size_bits-2)); + off = (loc << 2) & p_bd->sector_size_mask; + + fat_sector = FAT_getblk(sb, sec); + if (!fat_sector) + return -1; + + fat_entry = &(fat_sector[off]); + + SET32_A(fat_entry, content); + } + + FAT_modify(sb, sec); + return 0; +} /* end of __FAT_write */ + +u8 *FAT_getblk(struct super_block *sb, u32 sec) +{ + BUF_CACHE_T *bp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + bp = FAT_cache_find(sb, sec); + if (bp != NULL) { + move_to_mru(bp, &p_fs->FAT_cache_lru_list); + return bp->buf_bh->b_data; + } + + bp = FAT_cache_get(sb, sec); + + FAT_cache_remove_hash(bp); + + bp->drv = p_fs->drv; + bp->sec = sec; + bp->flag = 0; + + FAT_cache_insert_hash(sb, bp); + + if (sector_read(sb, sec, &(bp->buf_bh), 1) != FFS_SUCCESS) { + FAT_cache_remove_hash(bp); + bp->drv = -1; + bp->sec = ~0; + bp->flag = 0; + bp->buf_bh = NULL; + + move_to_lru(bp, &p_fs->FAT_cache_lru_list); + return NULL; + } + + return bp->buf_bh->b_data; +} /* end of FAT_getblk */ + +void FAT_modify(struct super_block *sb, u32 sec) +{ + BUF_CACHE_T *bp; + + bp = FAT_cache_find(sb, sec); + if (bp != NULL) + sector_write(sb, sec, bp->buf_bh, 0); +} /* end of FAT_modify */ + +void FAT_release_all(struct super_block *sb) +{ + BUF_CACHE_T *bp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + sm_P(&f_sem); + + bp = p_fs->FAT_cache_lru_list.next; + while (bp != &p_fs->FAT_cache_lru_list) { + if (bp->drv == p_fs->drv) { + bp->drv = -1; + bp->sec = ~0; + bp->flag = 0; + + if (bp->buf_bh) { + __brelse(bp->buf_bh); + bp->buf_bh = NULL; + } + } + bp = bp->next; + } + + sm_V(&f_sem); +} /* end of FAT_release_all */ + +void FAT_sync(struct super_block *sb) +{ + BUF_CACHE_T *bp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + sm_P(&f_sem); + + bp = p_fs->FAT_cache_lru_list.next; + while (bp != &p_fs->FAT_cache_lru_list) { + if ((bp->drv == p_fs->drv) && (bp->flag & DIRTYBIT)) { + sync_dirty_buffer(bp->buf_bh); + bp->flag &= ~(DIRTYBIT); + } + bp = bp->next; + } + + sm_V(&f_sem); +} /* end of FAT_sync */ + +static BUF_CACHE_T *FAT_cache_find(struct super_block *sb, u32 sec) +{ + s32 off; + BUF_CACHE_T *bp, *hp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + off = (sec + (sec >> p_fs->sectors_per_clu_bits)) & (FAT_CACHE_HASH_SIZE - 1); + + hp = &(p_fs->FAT_cache_hash_list[off]); + for (bp = hp->hash_next; bp != hp; bp = bp->hash_next) { + if ((bp->drv == p_fs->drv) && (bp->sec == sec)) { + + WARN(!bp->buf_bh, "[EXFAT] FAT_cache has no bh. " + "It will make system panic.\n"); + + touch_buffer(bp->buf_bh); + return bp; + } + } + return NULL; +} /* end of FAT_cache_find */ + +static BUF_CACHE_T *FAT_cache_get(struct super_block *sb, u32 sec) +{ + BUF_CACHE_T *bp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + bp = p_fs->FAT_cache_lru_list.prev; + + + move_to_mru(bp, &p_fs->FAT_cache_lru_list); + return bp; +} /* end of FAT_cache_get */ + +static void FAT_cache_insert_hash(struct super_block *sb, BUF_CACHE_T *bp) +{ + s32 off; + BUF_CACHE_T *hp; + FS_INFO_T *p_fs; + + p_fs = &(EXFAT_SB(sb)->fs_info); + off = (bp->sec + (bp->sec >> p_fs->sectors_per_clu_bits)) & (FAT_CACHE_HASH_SIZE-1); + + hp = &(p_fs->FAT_cache_hash_list[off]); + bp->hash_next = hp->hash_next; + bp->hash_prev = hp; + hp->hash_next->hash_prev = bp; + hp->hash_next = bp; +} /* end of FAT_cache_insert_hash */ + +static void FAT_cache_remove_hash(BUF_CACHE_T *bp) +{ + (bp->hash_prev)->hash_next = bp->hash_next; + (bp->hash_next)->hash_prev = bp->hash_prev; +} /* end of FAT_cache_remove_hash */ + +/*======================================================================*/ +/* Buffer Read/Write Functions */ +/*======================================================================*/ + +u8 *buf_getblk(struct super_block *sb, u32 sec) +{ + u8 *buf; + + sm_P(&b_sem); + + buf = __buf_getblk(sb, sec); + + sm_V(&b_sem); + + return buf; +} /* end of buf_getblk */ + +static u8 *__buf_getblk(struct super_block *sb, u32 sec) +{ + BUF_CACHE_T *bp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + bp = buf_cache_find(sb, sec); + if (bp != NULL) { + move_to_mru(bp, &p_fs->buf_cache_lru_list); + return bp->buf_bh->b_data; + } + + bp = buf_cache_get(sb, sec); + + buf_cache_remove_hash(bp); + + bp->drv = p_fs->drv; + bp->sec = sec; + bp->flag = 0; + + buf_cache_insert_hash(sb, bp); + + if (sector_read(sb, sec, &(bp->buf_bh), 1) != FFS_SUCCESS) { + buf_cache_remove_hash(bp); + bp->drv = -1; + bp->sec = ~0; + bp->flag = 0; + bp->buf_bh = NULL; + + move_to_lru(bp, &p_fs->buf_cache_lru_list); + return NULL; + } + + return bp->buf_bh->b_data; + +} /* end of __buf_getblk */ + +void buf_modify(struct super_block *sb, u32 sec) +{ + BUF_CACHE_T *bp; + + sm_P(&b_sem); + + bp = buf_cache_find(sb, sec); + if (likely(bp != NULL)) + sector_write(sb, sec, bp->buf_bh, 0); + + WARN(!bp, "[EXFAT] failed to find buffer_cache(sector:%u).\n", sec); + + sm_V(&b_sem); +} /* end of buf_modify */ + +void buf_lock(struct super_block *sb, u32 sec) +{ + BUF_CACHE_T *bp; + + sm_P(&b_sem); + + bp = buf_cache_find(sb, sec); + if (likely(bp != NULL)) + bp->flag |= LOCKBIT; + + WARN(!bp, "[EXFAT] failed to find buffer_cache(sector:%u).\n", sec); + + sm_V(&b_sem); +} /* end of buf_lock */ + +void buf_unlock(struct super_block *sb, u32 sec) +{ + BUF_CACHE_T *bp; + + sm_P(&b_sem); + + bp = buf_cache_find(sb, sec); + if (likely(bp != NULL)) + bp->flag &= ~(LOCKBIT); + + WARN(!bp, "[EXFAT] failed to find buffer_cache(sector:%u).\n", sec); + + sm_V(&b_sem); +} /* end of buf_unlock */ + +void buf_release(struct super_block *sb, u32 sec) +{ + BUF_CACHE_T *bp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + sm_P(&b_sem); + + bp = buf_cache_find(sb, sec); + if (likely(bp != NULL)) { + bp->drv = -1; + bp->sec = ~0; + bp->flag = 0; + + if (bp->buf_bh) { + __brelse(bp->buf_bh); + bp->buf_bh = NULL; + } + + move_to_lru(bp, &p_fs->buf_cache_lru_list); + } + + sm_V(&b_sem); +} /* end of buf_release */ + +void buf_release_all(struct super_block *sb) +{ + BUF_CACHE_T *bp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + sm_P(&b_sem); + + bp = p_fs->buf_cache_lru_list.next; + while (bp != &p_fs->buf_cache_lru_list) { + if (bp->drv == p_fs->drv) { + bp->drv = -1; + bp->sec = ~0; + bp->flag = 0; + + if (bp->buf_bh) { + __brelse(bp->buf_bh); + bp->buf_bh = NULL; + } + } + bp = bp->next; + } + + sm_V(&b_sem); +} /* end of buf_release_all */ + +void buf_sync(struct super_block *sb) +{ + BUF_CACHE_T *bp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + sm_P(&b_sem); + + bp = p_fs->buf_cache_lru_list.next; + while (bp != &p_fs->buf_cache_lru_list) { + if ((bp->drv == p_fs->drv) && (bp->flag & DIRTYBIT)) { + sync_dirty_buffer(bp->buf_bh); + bp->flag &= ~(DIRTYBIT); + } + bp = bp->next; + } + + sm_V(&b_sem); +} /* end of buf_sync */ + +static BUF_CACHE_T *buf_cache_find(struct super_block *sb, u32 sec) +{ + s32 off; + BUF_CACHE_T *bp, *hp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + off = (sec + (sec >> p_fs->sectors_per_clu_bits)) & (BUF_CACHE_HASH_SIZE - 1); + + hp = &(p_fs->buf_cache_hash_list[off]); + for (bp = hp->hash_next; bp != hp; bp = bp->hash_next) { + if ((bp->drv == p_fs->drv) && (bp->sec == sec)) { + touch_buffer(bp->buf_bh); + return bp; + } + } + return NULL; +} /* end of buf_cache_find */ + +static BUF_CACHE_T *buf_cache_get(struct super_block *sb, u32 sec) +{ + BUF_CACHE_T *bp; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + bp = p_fs->buf_cache_lru_list.prev; + while (bp->flag & LOCKBIT) + bp = bp->prev; + + + move_to_mru(bp, &p_fs->buf_cache_lru_list); + return bp; +} /* end of buf_cache_get */ + +static void buf_cache_insert_hash(struct super_block *sb, BUF_CACHE_T *bp) +{ + s32 off; + BUF_CACHE_T *hp; + FS_INFO_T *p_fs; + + p_fs = &(EXFAT_SB(sb)->fs_info); + off = (bp->sec + (bp->sec >> p_fs->sectors_per_clu_bits)) & (BUF_CACHE_HASH_SIZE-1); + + hp = &(p_fs->buf_cache_hash_list[off]); + bp->hash_next = hp->hash_next; + bp->hash_prev = hp; + hp->hash_next->hash_prev = bp; + hp->hash_next = bp; +} /* end of buf_cache_insert_hash */ + +static void buf_cache_remove_hash(BUF_CACHE_T *bp) +{ + (bp->hash_prev)->hash_next = bp->hash_next; + (bp->hash_next)->hash_prev = bp->hash_prev; +} /* end of buf_cache_remove_hash */ + +/*======================================================================*/ +/* Local Function Definitions */ +/*======================================================================*/ + +static void push_to_mru(BUF_CACHE_T *bp, BUF_CACHE_T *list) +{ + bp->next = list->next; + bp->prev = list; + list->next->prev = bp; + list->next = bp; +} /* end of buf_cache_push_to_mru */ + +static void push_to_lru(BUF_CACHE_T *bp, BUF_CACHE_T *list) +{ + bp->prev = list->prev; + bp->next = list; + list->prev->next = bp; + list->prev = bp; +} /* end of buf_cache_push_to_lru */ + +static void move_to_mru(BUF_CACHE_T *bp, BUF_CACHE_T *list) +{ + bp->prev->next = bp->next; + bp->next->prev = bp->prev; + push_to_mru(bp, list); +} /* end of buf_cache_move_to_mru */ + +static void move_to_lru(BUF_CACHE_T *bp, BUF_CACHE_T *list) +{ + bp->prev->next = bp->next; + bp->next->prev = bp->prev; + push_to_lru(bp, list); +} /* end of buf_cache_move_to_lru */ diff --git b/fs/exfat/exfat_cache.h b/fs/exfat/exfat_cache.h new file mode 100644 index 0000000..d82aad5 --- /dev/null +++ b/fs/exfat/exfat_cache.h @@ -0,0 +1,85 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_cache.h */ +/* PURPOSE : Header File for exFAT Cache Manager */ +/* (FAT Cache & Buffer Cache) */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Sung-Kwan Kim] : first writing */ +/* */ +/************************************************************************/ + +#ifndef _EXFAT_CACHE_H +#define _EXFAT_CACHE_H + +#include +#include +#include "exfat_config.h" + +/*----------------------------------------------------------------------*/ +/* Constant & Macro Definitions */ +/*----------------------------------------------------------------------*/ + +#define LOCKBIT 0x01 +#define DIRTYBIT 0x02 + +/*----------------------------------------------------------------------*/ +/* Type Definitions */ +/*----------------------------------------------------------------------*/ + +typedef struct __BUF_CACHE_T { + struct __BUF_CACHE_T *next; + struct __BUF_CACHE_T *prev; + struct __BUF_CACHE_T *hash_next; + struct __BUF_CACHE_T *hash_prev; + s32 drv; + u32 sec; + u32 flag; + struct buffer_head *buf_bh; +} BUF_CACHE_T; + +/*----------------------------------------------------------------------*/ +/* External Function Declarations */ +/*----------------------------------------------------------------------*/ + +s32 buf_init(struct super_block *sb); +s32 buf_shutdown(struct super_block *sb); +s32 FAT_read(struct super_block *sb, u32 loc, u32 *content); +s32 FAT_write(struct super_block *sb, u32 loc, u32 content); +u8 *FAT_getblk(struct super_block *sb, u32 sec); +void FAT_modify(struct super_block *sb, u32 sec); +void FAT_release_all(struct super_block *sb); +void FAT_sync(struct super_block *sb); +u8 *buf_getblk(struct super_block *sb, u32 sec); +void buf_modify(struct super_block *sb, u32 sec); +void buf_lock(struct super_block *sb, u32 sec); +void buf_unlock(struct super_block *sb, u32 sec); +void buf_release(struct super_block *sb, u32 sec); +void buf_release_all(struct super_block *sb); +void buf_sync(struct super_block *sb); + +#endif /* _EXFAT_CACHE_H */ diff --git b/fs/exfat/exfat_config.h b/fs/exfat/exfat_config.h new file mode 100644 index 0000000..33c6525 --- /dev/null +++ b/fs/exfat/exfat_config.h @@ -0,0 +1,69 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_config.h */ +/* PURPOSE : Header File for exFAT Configuable Policies */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#ifndef _EXFAT_CONFIG_H +#define _EXFAT_CONFIG_H + +/*======================================================================*/ +/* */ +/* FFS CONFIGURATIONS */ +/* (CHANGE THIS PART IF REQUIRED) */ +/* */ +/*======================================================================*/ + +/*----------------------------------------------------------------------*/ +/* Feature Config */ +/*----------------------------------------------------------------------*/ +#ifndef CONFIG_EXFAT_DISCARD +#define CONFIG_EXFAT_DISCARD 1 /* mount option -o discard support */ +#endif + +#ifndef CONFIG_EXFAT_DELAYED_SYNC +#define CONFIG_EXFAT_DELAYED_SYNC 0 +#endif + +#ifndef CONFIG_EXFAT_KERNEL_DEBUG +#define CONFIG_EXFAT_KERNEL_DEBUG 1 /* kernel debug features via ioctl */ +#endif + +#ifndef CONFIG_EXFAT_DEBUG_MSG +#define CONFIG_EXFAT_DEBUG_MSG 0 /* debugging message on/off */ +#endif + +#ifndef CONFIG_EXFAT_DEFAULT_CODEPAGE +#define CONFIG_EXFAT_DEFAULT_CODEPAGE 437 +#define CONFIG_EXFAT_DEFAULT_IOCHARSET "utf8" +#endif + +#endif /* _EXFAT_CONFIG_H */ diff --git b/fs/exfat/exfat_core.c b/fs/exfat/exfat_core.c new file mode 100644 index 0000000..d485141 --- /dev/null +++ b/fs/exfat/exfat_core.c @@ -0,0 +1,5114 @@ +/* Some of the source code in this file came from "linux/fs/fat/misc.c". */ +/* + * linux/fs/fat/misc.c + * + * Written 1992,1993 by Werner Almesberger + * 22/11/2000 - Fixed fat_date_unix2dos for dates earlier than 01/01/1980 + * and date_dos2unix for date==0 by Igor Zhbanov(bsg@uniyar.ac.ru) + */ + +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_core.c */ +/* PURPOSE : exFAT File Manager */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#include +#include +#include + +#include "exfat_bitmap.h" +#include "exfat_config.h" +#include "exfat_data.h" +#include "exfat_oal.h" +#include "exfat_blkdev.h" +#include "exfat_cache.h" +#include "exfat_nls.h" +#include "exfat_api.h" +#include "exfat_super.h" +#include "exfat_core.h" + +#include + +static void __set_sb_dirty(struct super_block *sb) +{ +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,7,0) + sb->s_dirt = 1; +#else + struct exfat_sb_info *sbi = EXFAT_SB(sb); + sbi->s_dirt = 1; +#endif +} + +/*----------------------------------------------------------------------*/ +/* Global Variable Definitions */ +/*----------------------------------------------------------------------*/ + +extern u8 uni_upcase[]; + +/*----------------------------------------------------------------------*/ +/* Local Variable Definitions */ +/*----------------------------------------------------------------------*/ + +static u8 name_buf[MAX_PATH_LENGTH * MAX_CHARSET_SIZE]; + +static char *reserved_names[] = { + "AUX ", "CON ", "NUL ", "PRN ", + "COM1 ", "COM2 ", "COM3 ", "COM4 ", + "COM5 ", "COM6 ", "COM7 ", "COM8 ", "COM9 ", + "LPT1 ", "LPT2 ", "LPT3 ", "LPT4 ", + "LPT5 ", "LPT6 ", "LPT7 ", "LPT8 ", "LPT9 ", + NULL +}; + +static u8 free_bit[] = { + 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, /* 0 ~ 19 */ + 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, /* 20 ~ 39 */ + 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, /* 40 ~ 59 */ + 0, 1, 0, 6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, /* 60 ~ 79 */ + 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, /* 80 ~ 99 */ + 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, /* 100 ~ 119 */ + 0, 1, 0, 2, 0, 1, 0, 7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, /* 120 ~ 139 */ + 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, /* 140 ~ 159 */ + 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, /* 160 ~ 179 */ + 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6, 0, 1, 0, 2, 0, 1, 0, 3, /* 180 ~ 199 */ + 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, /* 200 ~ 219 */ + 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, /* 220 ~ 239 */ + 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0 /* 240 ~ 254 */ +}; + +static u8 used_bit[] = { + 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, /* 0 ~ 19 */ + 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, /* 20 ~ 39 */ + 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, /* 40 ~ 59 */ + 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, /* 60 ~ 79 */ + 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, /* 80 ~ 99 */ + 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, /* 100 ~ 119 */ + 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, /* 120 ~ 139 */ + 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, /* 140 ~ 159 */ + 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, /* 160 ~ 179 */ + 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, /* 180 ~ 199 */ + 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, /* 200 ~ 219 */ + 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, /* 220 ~ 239 */ + 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 /* 240 ~ 255 */ +}; + +/*======================================================================*/ +/* Global Function Definitions */ +/*======================================================================*/ + +/* ffsInit : roll back to the initial state of the file system */ +s32 ffsInit(void) +{ + s32 ret; + + ret = bdev_init(); + if (ret) + return ret; + + ret = fs_init(); + if (ret) + return ret; + + return FFS_SUCCESS; +} /* end of ffsInit */ + +/* ffsShutdown : make free all memory-alloced global buffers */ +s32 ffsShutdown(void) +{ + s32 ret; + ret = fs_shutdown(); + if (ret) + return ret; + + ret = bdev_shutdown(); + if (ret) + return ret; + + return FFS_SUCCESS; +} /* end of ffsShutdown */ + +/* ffsMountVol : mount the file system volume */ +s32 ffsMountVol(struct super_block *sb) +{ + int i, ret; + PBR_SECTOR_T *p_pbr; + struct buffer_head *tmp_bh = NULL; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + printk("[EXFAT] trying to mount...\n"); + + sm_init(&p_fs->v_sem); + p_fs->dev_ejected = FALSE; + + /* open the block device */ + if (bdev_open(sb)) + return FFS_MEDIAERR; + + if (p_bd->sector_size < sb->s_blocksize) + return FFS_MEDIAERR; + if (p_bd->sector_size > sb->s_blocksize) + sb_set_blocksize(sb, p_bd->sector_size); + + /* read Sector 0 */ + if (sector_read(sb, 0, &tmp_bh, 1) != FFS_SUCCESS) + return FFS_MEDIAERR; + + p_fs->PBR_sector = 0; + + p_pbr = (PBR_SECTOR_T *) tmp_bh->b_data; + + /* check the validity of PBR */ + if (GET16_A(p_pbr->signature) != PBR_SIGNATURE) { + brelse(tmp_bh); + bdev_close(sb); + return FFS_FORMATERR; + } + + /* fill fs_stuct */ + for (i = 0; i < 53; i++) + if (p_pbr->bpb[i]) + break; + + if (i < 53) { + if (GET16(p_pbr->bpb+11)) /* num_fat_sectors */ + ret = fat16_mount(sb, p_pbr); + else + ret = fat32_mount(sb, p_pbr); + } else { + ret = exfat_mount(sb, p_pbr); + } + + brelse(tmp_bh); + + if (ret) { + bdev_close(sb); + return ret; + } + + if (p_fs->vol_type == EXFAT) { + ret = load_alloc_bitmap(sb); + if (ret) { + bdev_close(sb); + return ret; + } + ret = load_upcase_table(sb); + if (ret) { + free_alloc_bitmap(sb); + bdev_close(sb); + return ret; + } + } + + if (p_fs->dev_ejected) { + if (p_fs->vol_type == EXFAT) { + free_upcase_table(sb); + free_alloc_bitmap(sb); + } + bdev_close(sb); + return FFS_MEDIAERR; + } + + printk("[EXFAT] mounted successfully\n"); + + return FFS_SUCCESS; +} /* end of ffsMountVol */ + +/* ffsUmountVol : umount the file system volume */ +s32 ffsUmountVol(struct super_block *sb) +{ + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + printk("[EXFAT] trying to unmount...\n"); + + fs_sync(sb, 0); + fs_set_vol_flags(sb, VOL_CLEAN); + + if (p_fs->vol_type == EXFAT) { + free_upcase_table(sb); + free_alloc_bitmap(sb); + } + + FAT_release_all(sb); + buf_release_all(sb); + + /* close the block device */ + bdev_close(sb); + + if (p_fs->dev_ejected) { + printk("[EXFAT] unmounted with media errors. " + "device's already ejected.\n"); + return FFS_MEDIAERR; + } + + printk("[EXFAT] unmounted successfully\n"); + + return FFS_SUCCESS; +} /* end of ffsUmountVol */ + +/* ffsGetVolInfo : get the information of a file system volume */ +s32 ffsGetVolInfo(struct super_block *sb, VOL_INFO_T *info) +{ + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (p_fs->used_clusters == (u32) ~0) + p_fs->used_clusters = p_fs->fs_func->count_used_clusters(sb); + + info->FatType = p_fs->vol_type; + info->ClusterSize = p_fs->cluster_size; + info->NumClusters = p_fs->num_clusters - 2; /* clu 0 & 1 */ + info->UsedClusters = p_fs->used_clusters; + info->FreeClusters = info->NumClusters - info->UsedClusters; + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsGetVolInfo */ + +/* ffsSyncVol : synchronize all file system volumes */ +s32 ffsSyncVol(struct super_block *sb, s32 do_sync) +{ + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* synchronize the file system */ + fs_sync(sb, do_sync); + fs_set_vol_flags(sb, VOL_CLEAN); + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsSyncVol */ + +/*----------------------------------------------------------------------*/ +/* File Operation Functions */ +/*----------------------------------------------------------------------*/ + +/* ffsLookupFile : lookup a file */ +s32 ffsLookupFile(struct inode *inode, char *path, FILE_ID_T *fid) +{ + s32 ret, dentry, num_entries; + CHAIN_T dir; + UNI_NAME_T uni_name; + DOS_NAME_T dos_name; + DENTRY_T *ep, *ep2; + ENTRY_SET_CACHE_T *es = NULL; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + DPRINTK("ffsLookupFile entered\n"); + + /* check the validity of directory name in the given pathname */ + ret = resolve_path(inode, path, &dir, &uni_name); + if (ret) + return ret; + + ret = get_num_entries_and_dos_name(sb, &dir, &uni_name, &num_entries, &dos_name); + if (ret) + return ret; + + /* search the file name for directories */ + dentry = p_fs->fs_func->find_dir_entry(sb, &dir, &uni_name, num_entries, &dos_name, TYPE_ALL); + if (dentry < -1) + return FFS_NOTFOUND; + + fid->dir.dir = dir.dir; + fid->dir.size = dir.size; + fid->dir.flags = dir.flags; + fid->entry = dentry; + + if (dentry == -1) { + fid->type = TYPE_DIR; + fid->rwoffset = 0; + fid->hint_last_off = -1; + + fid->attr = ATTR_SUBDIR; + fid->flags = 0x01; + fid->size = 0; + fid->start_clu = p_fs->root_dir; + } else { + if (p_fs->vol_type == EXFAT) { + es = get_entry_set_in_dir(sb, &dir, dentry, ES_2_ENTRIES, &ep); + if (!es) + return FFS_MEDIAERR; + ep2 = ep+1; + } else { + ep = get_entry_in_dir(sb, &dir, dentry, NULL); + if (!ep) + return FFS_MEDIAERR; + ep2 = ep; + } + + fid->type = p_fs->fs_func->get_entry_type(ep); + fid->rwoffset = 0; + fid->hint_last_off = -1; + fid->attr = p_fs->fs_func->get_entry_attr(ep); + + fid->size = p_fs->fs_func->get_entry_size(ep2); + if ((fid->type == TYPE_FILE) && (fid->size == 0)) { + fid->flags = (p_fs->vol_type == EXFAT) ? 0x03 : 0x01; + fid->start_clu = CLUSTER_32(~0); + } else { + fid->flags = p_fs->fs_func->get_entry_flag(ep2); + fid->start_clu = p_fs->fs_func->get_entry_clu0(ep2); + } + + if (p_fs->vol_type == EXFAT) + release_entry_set(es); + } + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + DPRINTK("ffsLookupFile exited successfully\n"); + + return FFS_SUCCESS; +} /* end of ffsLookupFile */ + +/* ffsCreateFile : create a file */ +s32 ffsCreateFile(struct inode *inode, char *path, u8 mode, FILE_ID_T *fid) +{ + s32 ret/*, dentry*/; + CHAIN_T dir; + UNI_NAME_T uni_name; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + /* check the validity of directory name in the given pathname */ + ret = resolve_path(inode, path, &dir, &uni_name); + if (ret) + return ret; + + fs_set_vol_flags(sb, VOL_DIRTY); + + /* create a new file */ + ret = create_file(inode, &dir, &uni_name, mode, fid); + +#ifdef CONFIG_EXFAT_DELAYED_SYNC + fs_sync(sb, 0); + fs_set_vol_flags(sb, VOL_CLEAN); +#endif + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return ret; +} /* end of ffsCreateFile */ + +/* ffsReadFile : read data from a opened file */ +s32 ffsReadFile(struct inode *inode, FILE_ID_T *fid, void *buffer, u64 count, u64 *rcount) +{ + s32 offset, sec_offset, clu_offset; + u32 clu, LogSector; + u64 oneblkread, read_bytes; + struct buffer_head *tmp_bh = NULL; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + /* check if the given file ID is opened */ + if (fid->type != TYPE_FILE) + return FFS_PERMISSIONERR; + + if (fid->rwoffset > fid->size) + fid->rwoffset = fid->size; + + if (count > (fid->size - fid->rwoffset)) + count = fid->size - fid->rwoffset; + + if (count == 0) { + if (rcount != NULL) + *rcount = 0; + return FFS_EOF; + } + + read_bytes = 0; + + while (count > 0) { + clu_offset = (s32)(fid->rwoffset >> p_fs->cluster_size_bits); + clu = fid->start_clu; + + if (fid->flags == 0x03) { + clu += clu_offset; + } else { + /* hint information */ + if ((clu_offset > 0) && (fid->hint_last_off > 0) && + (clu_offset >= fid->hint_last_off)) { + clu_offset -= fid->hint_last_off; + clu = fid->hint_last_clu; + } + + while (clu_offset > 0) { + /* clu = FAT_read(sb, clu); */ + if (FAT_read(sb, clu, &clu) == -1) + return FFS_MEDIAERR; + + clu_offset--; + } + } + + /* hint information */ + fid->hint_last_off = (s32)(fid->rwoffset >> p_fs->cluster_size_bits); + fid->hint_last_clu = clu; + + offset = (s32)(fid->rwoffset & (p_fs->cluster_size-1)); /* byte offset in cluster */ + sec_offset = offset >> p_bd->sector_size_bits; /* sector offset in cluster */ + offset &= p_bd->sector_size_mask; /* byte offset in sector */ + + LogSector = START_SECTOR(clu) + sec_offset; + + oneblkread = (u64)(p_bd->sector_size - offset); + if (oneblkread > count) + oneblkread = count; + + if ((offset == 0) && (oneblkread == p_bd->sector_size)) { + if (sector_read(sb, LogSector, &tmp_bh, 1) != FFS_SUCCESS) + goto err_out; + memcpy(((char *) buffer)+read_bytes, ((char *) tmp_bh->b_data), (s32) oneblkread); + } else { + if (sector_read(sb, LogSector, &tmp_bh, 1) != FFS_SUCCESS) + goto err_out; + memcpy(((char *) buffer)+read_bytes, ((char *) tmp_bh->b_data)+offset, (s32) oneblkread); + } + count -= oneblkread; + read_bytes += oneblkread; + fid->rwoffset += oneblkread; + } + brelse(tmp_bh); + +err_out: + /* set the size of read bytes */ + if (rcount != NULL) + *rcount = read_bytes; + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsReadFile */ + +/* ffsWriteFile : write data into a opened file */ +s32 ffsWriteFile(struct inode *inode, FILE_ID_T *fid, void *buffer, u64 count, u64 *wcount) +{ + s32 modified = FALSE, offset, sec_offset, clu_offset; + s32 num_clusters, num_alloc, num_alloced = (s32) ~0; + u32 clu, last_clu, LogSector, sector = 0; + u64 oneblkwrite, write_bytes; + CHAIN_T new_clu; + TIMESTAMP_T tm; + DENTRY_T *ep, *ep2; + ENTRY_SET_CACHE_T *es = NULL; + struct buffer_head *tmp_bh = NULL; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + /* check if the given file ID is opened */ + if (fid->type != TYPE_FILE) + return FFS_PERMISSIONERR; + + if (fid->rwoffset > fid->size) + fid->rwoffset = fid->size; + + if (count == 0) { + if (wcount != NULL) + *wcount = 0; + return FFS_SUCCESS; + } + + fs_set_vol_flags(sb, VOL_DIRTY); + + if (fid->size == 0) + num_clusters = 0; + else + num_clusters = (s32)((fid->size-1) >> p_fs->cluster_size_bits) + 1; + + write_bytes = 0; + + while (count > 0) { + clu_offset = (s32)(fid->rwoffset >> p_fs->cluster_size_bits); + clu = last_clu = fid->start_clu; + + if (fid->flags == 0x03) { + if ((clu_offset > 0) && (clu != CLUSTER_32(~0))) { + last_clu += clu_offset - 1; + + if (clu_offset == num_clusters) + clu = CLUSTER_32(~0); + else + clu += clu_offset; + } + } else { + /* hint information */ + if ((clu_offset > 0) && (fid->hint_last_off > 0) && + (clu_offset >= fid->hint_last_off)) { + clu_offset -= fid->hint_last_off; + clu = fid->hint_last_clu; + } + + while ((clu_offset > 0) && (clu != CLUSTER_32(~0))) { + last_clu = clu; + /* clu = FAT_read(sb, clu); */ + if (FAT_read(sb, clu, &clu) == -1) + return FFS_MEDIAERR; + + clu_offset--; + } + } + + if (clu == CLUSTER_32(~0)) { + num_alloc = (s32)((count-1) >> p_fs->cluster_size_bits) + 1; + new_clu.dir = (last_clu == CLUSTER_32(~0)) ? CLUSTER_32(~0) : last_clu+1; + new_clu.size = 0; + new_clu.flags = fid->flags; + + /* (1) allocate a chain of clusters */ + num_alloced = p_fs->fs_func->alloc_cluster(sb, num_alloc, &new_clu); + if (num_alloced == 0) + break; + else if (num_alloced < 0) + return FFS_MEDIAERR; + + /* (2) append to the FAT chain */ + if (last_clu == CLUSTER_32(~0)) { + if (new_clu.flags == 0x01) + fid->flags = 0x01; + fid->start_clu = new_clu.dir; + modified = TRUE; + } else { + if (new_clu.flags != fid->flags) { + exfat_chain_cont_cluster(sb, fid->start_clu, num_clusters); + fid->flags = 0x01; + modified = TRUE; + } + if (new_clu.flags == 0x01) + FAT_write(sb, last_clu, new_clu.dir); + } + + num_clusters += num_alloced; + clu = new_clu.dir; + } + + /* hint information */ + fid->hint_last_off = (s32)(fid->rwoffset >> p_fs->cluster_size_bits); + fid->hint_last_clu = clu; + + offset = (s32)(fid->rwoffset & (p_fs->cluster_size-1)); /* byte offset in cluster */ + sec_offset = offset >> p_bd->sector_size_bits; /* sector offset in cluster */ + offset &= p_bd->sector_size_mask; /* byte offset in sector */ + + LogSector = START_SECTOR(clu) + sec_offset; + + oneblkwrite = (u64)(p_bd->sector_size - offset); + if (oneblkwrite > count) + oneblkwrite = count; + + if ((offset == 0) && (oneblkwrite == p_bd->sector_size)) { + if (sector_read(sb, LogSector, &tmp_bh, 0) != FFS_SUCCESS) + goto err_out; + memcpy(((char *) tmp_bh->b_data), ((char *) buffer)+write_bytes, (s32) oneblkwrite); + if (sector_write(sb, LogSector, tmp_bh, 0) != FFS_SUCCESS) { + brelse(tmp_bh); + goto err_out; + } + } else { + if ((offset > 0) || ((fid->rwoffset+oneblkwrite) < fid->size)) { + if (sector_read(sb, LogSector, &tmp_bh, 1) != FFS_SUCCESS) + goto err_out; + } else { + if (sector_read(sb, LogSector, &tmp_bh, 0) != FFS_SUCCESS) + goto err_out; + } + + memcpy(((char *) tmp_bh->b_data)+offset, ((char *) buffer)+write_bytes, (s32) oneblkwrite); + if (sector_write(sb, LogSector, tmp_bh, 0) != FFS_SUCCESS) { + brelse(tmp_bh); + goto err_out; + } + } + + count -= oneblkwrite; + write_bytes += oneblkwrite; + fid->rwoffset += oneblkwrite; + + fid->attr |= ATTR_ARCHIVE; + + if (fid->size < fid->rwoffset) { + fid->size = fid->rwoffset; + modified = TRUE; + } + } + + brelse(tmp_bh); + + /* (3) update the direcoty entry */ + if (p_fs->vol_type == EXFAT) { + es = get_entry_set_in_dir(sb, &(fid->dir), fid->entry, ES_ALL_ENTRIES, &ep); + if (es == NULL) + goto err_out; + ep2 = ep+1; + } else { + ep = get_entry_in_dir(sb, &(fid->dir), fid->entry, §or); + if (!ep) + goto err_out; + ep2 = ep; + } + + p_fs->fs_func->set_entry_time(ep, tm_current(&tm), TM_MODIFY); + p_fs->fs_func->set_entry_attr(ep, fid->attr); + + if (p_fs->vol_type != EXFAT) + buf_modify(sb, sector); + + if (modified) { + if (p_fs->fs_func->get_entry_flag(ep2) != fid->flags) + p_fs->fs_func->set_entry_flag(ep2, fid->flags); + + if (p_fs->fs_func->get_entry_size(ep2) != fid->size) + p_fs->fs_func->set_entry_size(ep2, fid->size); + + if (p_fs->fs_func->get_entry_clu0(ep2) != fid->start_clu) + p_fs->fs_func->set_entry_clu0(ep2, fid->start_clu); + + if (p_fs->vol_type != EXFAT) + buf_modify(sb, sector); + } + + if (p_fs->vol_type == EXFAT) { + update_dir_checksum_with_entry_set(sb, es); + release_entry_set(es); + } + +#ifdef CONFIG_EXFAT_DELAYED_SYNC + fs_sync(sb, 0); + fs_set_vol_flags(sb, VOL_CLEAN); +#endif + +err_out: + /* set the size of written bytes */ + if (wcount != NULL) + *wcount = write_bytes; + + if (num_alloced == 0) + return FFS_FULL; + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsWriteFile */ + +/* ffsTruncateFile : resize the file length */ +s32 ffsTruncateFile(struct inode *inode, u64 old_size, u64 new_size) +{ + s32 num_clusters; + u32 last_clu = CLUSTER_32(0), sector = 0; + CHAIN_T clu; + TIMESTAMP_T tm; + DENTRY_T *ep, *ep2; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + FILE_ID_T *fid = &(EXFAT_I(inode)->fid); + ENTRY_SET_CACHE_T *es = NULL; + + /* check if the given file ID is opened */ + if (fid->type != TYPE_FILE) + return FFS_PERMISSIONERR; + + if (fid->size != old_size) { + printk(KERN_ERR "[EXFAT] truncate : can't skip it because of " + "size-mismatch(old:%lld->fid:%lld).\n" + ,old_size, fid->size); + } + + if (old_size <= new_size) + return FFS_SUCCESS; + + fs_set_vol_flags(sb, VOL_DIRTY); + + clu.dir = fid->start_clu; + clu.size = (s32)((old_size-1) >> p_fs->cluster_size_bits) + 1; + clu.flags = fid->flags; + + if (new_size > 0) { + num_clusters = (s32)((new_size-1) >> p_fs->cluster_size_bits) + 1; + + if (clu.flags == 0x03) { + clu.dir += num_clusters; + } else { + while (num_clusters > 0) { + last_clu = clu.dir; + if (FAT_read(sb, clu.dir, &(clu.dir)) == -1) + return FFS_MEDIAERR; + num_clusters--; + } + } + + clu.size -= num_clusters; + } + + fid->size = new_size; + fid->attr |= ATTR_ARCHIVE; + if (new_size == 0) { + fid->flags = (p_fs->vol_type == EXFAT) ? 0x03 : 0x01; + fid->start_clu = CLUSTER_32(~0); + } + + /* (1) update the directory entry */ + if (p_fs->vol_type == EXFAT) { + es = get_entry_set_in_dir(sb, &(fid->dir), fid->entry, ES_ALL_ENTRIES, &ep); + if (es == NULL) + return FFS_MEDIAERR; + ep2 = ep+1; + } else { + ep = get_entry_in_dir(sb, &(fid->dir), fid->entry, §or); + if (!ep) + return FFS_MEDIAERR; + ep2 = ep; + } + + p_fs->fs_func->set_entry_time(ep, tm_current(&tm), TM_MODIFY); + p_fs->fs_func->set_entry_attr(ep, fid->attr); + + p_fs->fs_func->set_entry_size(ep2, new_size); + if (new_size == 0) { + p_fs->fs_func->set_entry_flag(ep2, 0x01); + p_fs->fs_func->set_entry_clu0(ep2, CLUSTER_32(0)); + } + + if (p_fs->vol_type != EXFAT) + buf_modify(sb, sector); + else { + update_dir_checksum_with_entry_set(sb, es); + release_entry_set(es); + } + + /* (2) cut off from the FAT chain */ + if (last_clu != CLUSTER_32(0)) { + if (fid->flags == 0x01) + FAT_write(sb, last_clu, CLUSTER_32(~0)); + } + + /* (3) free the clusters */ + p_fs->fs_func->free_cluster(sb, &clu, 0); + + /* hint information */ + fid->hint_last_off = -1; + if (fid->rwoffset > fid->size) + fid->rwoffset = fid->size; + +#ifdef CONFIG_EXFAT_DELAYED_SYNC + fs_sync(sb, 0); + fs_set_vol_flags(sb, VOL_CLEAN); +#endif + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsTruncateFile */ + +static void update_parent_info(FILE_ID_T *fid, struct inode *parent_inode) +{ + FS_INFO_T *p_fs = &(EXFAT_SB(parent_inode->i_sb)->fs_info); + FILE_ID_T *parent_fid = &(EXFAT_I(parent_inode)->fid); + + if (unlikely((parent_fid->flags != fid->dir.flags) + || (parent_fid->size != (fid->dir.size<cluster_size_bits)) + || (parent_fid->start_clu != fid->dir.dir))) { + + fid->dir.dir = parent_fid->start_clu; + fid->dir.flags = parent_fid->flags; + fid->dir.size = ((parent_fid->size + (p_fs->cluster_size-1)) + >> p_fs->cluster_size_bits); + } +} + +/* ffsMoveFile : move(rename) a old file into a new file */ +s32 ffsMoveFile(struct inode *old_parent_inode, FILE_ID_T *fid, struct inode *new_parent_inode, struct dentry *new_dentry) +{ + s32 ret; + s32 dentry; + CHAIN_T olddir, newdir; + CHAIN_T *p_dir = NULL; + UNI_NAME_T uni_name; + DENTRY_T *ep; + struct super_block *sb = old_parent_inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + u8 *new_path = (u8 *) new_dentry->d_name.name; + struct inode *new_inode = new_dentry->d_inode; + int num_entries; + FILE_ID_T *new_fid = NULL; + s32 new_entry = 0; + + /* check the validity of pointer parameters */ + if ((new_path == NULL) || (*new_path == '\0')) + return FFS_ERROR; + + update_parent_info(fid, old_parent_inode); + + olddir.dir = fid->dir.dir; + olddir.size = fid->dir.size; + olddir.flags = fid->dir.flags; + + dentry = fid->entry; + + /* check if the old file is "." or ".." */ + if (p_fs->vol_type != EXFAT) { + if ((olddir.dir != p_fs->root_dir) && (dentry < 2)) + return FFS_PERMISSIONERR; + } + + ep = get_entry_in_dir(sb, &olddir, dentry, NULL); + if (!ep) + return FFS_MEDIAERR; + + if (p_fs->fs_func->get_entry_attr(ep) & ATTR_READONLY) + return FFS_PERMISSIONERR; + + /* check whether new dir is existing directory and empty */ + if (new_inode) { + u32 entry_type; + + ret = FFS_MEDIAERR; + new_fid = &EXFAT_I(new_inode)->fid; + + update_parent_info(new_fid, new_parent_inode); + + p_dir = &(new_fid->dir); + new_entry = new_fid->entry; + ep = get_entry_in_dir(sb, p_dir, new_entry, NULL); + if (!ep) + goto out; + + entry_type = p_fs->fs_func->get_entry_type(ep); + + if (entry_type == TYPE_DIR) { + CHAIN_T new_clu; + new_clu.dir = new_fid->start_clu; + new_clu.size = (s32)((new_fid->size-1) >> p_fs->cluster_size_bits) + 1; + new_clu.flags = new_fid->flags; + + if (!is_dir_empty(sb, &new_clu)) + return FFS_FILEEXIST; + } + } + + /* check the validity of directory name in the given new pathname */ + ret = resolve_path(new_parent_inode, new_path, &newdir, &uni_name); + if (ret) + return ret; + + fs_set_vol_flags(sb, VOL_DIRTY); + + if (olddir.dir == newdir.dir) + ret = rename_file(new_parent_inode, &olddir, dentry, &uni_name, fid); + else + ret = move_file(new_parent_inode, &olddir, dentry, &newdir, &uni_name, fid); + + if ((ret == FFS_SUCCESS) && new_inode) { + /* delete entries of new_dir */ + ep = get_entry_in_dir(sb, p_dir, new_entry, NULL); + if (!ep) + goto out; + + num_entries = p_fs->fs_func->count_ext_entries(sb, p_dir, new_entry, ep); + if (num_entries < 0) + goto out; + p_fs->fs_func->delete_dir_entry(sb, p_dir, new_entry, 0, num_entries+1); + } +out: +#ifdef CONFIG_EXFAT_DELAYED_SYNC + fs_sync(sb, 0); + fs_set_vol_flags(sb, VOL_CLEAN); +#endif + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return ret; +} /* end of ffsMoveFile */ + +/* ffsRemoveFile : remove a file */ +s32 ffsRemoveFile(struct inode *inode, FILE_ID_T *fid) +{ + s32 dentry; + CHAIN_T dir, clu_to_free; + DENTRY_T *ep; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + dir.dir = fid->dir.dir; + dir.size = fid->dir.size; + dir.flags = fid->dir.flags; + + dentry = fid->entry; + + ep = get_entry_in_dir(sb, &dir, dentry, NULL); + if (!ep) + return FFS_MEDIAERR; + + if (p_fs->fs_func->get_entry_attr(ep) & ATTR_READONLY) + return FFS_PERMISSIONERR; + + fs_set_vol_flags(sb, VOL_DIRTY); + + /* (1) update the directory entry */ + remove_file(inode, &dir, dentry); + + clu_to_free.dir = fid->start_clu; + clu_to_free.size = (s32)((fid->size-1) >> p_fs->cluster_size_bits) + 1; + clu_to_free.flags = fid->flags; + + /* (2) free the clusters */ + p_fs->fs_func->free_cluster(sb, &clu_to_free, 0); + + fid->size = 0; + fid->start_clu = CLUSTER_32(~0); + fid->flags = (p_fs->vol_type == EXFAT) ? 0x03 : 0x01; + +#ifdef CONFIG_EXFAT_DELAYED_SYNC + fs_sync(sb, 0); + fs_set_vol_flags(sb, VOL_CLEAN); +#endif + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsRemoveFile */ + +/* ffsSetAttr : set the attribute of a given file */ +s32 ffsSetAttr(struct inode *inode, u32 attr) +{ + u32 type, sector = 0; + DENTRY_T *ep; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + FILE_ID_T *fid = &(EXFAT_I(inode)->fid); + u8 is_dir = (fid->type == TYPE_DIR) ? 1 : 0; + ENTRY_SET_CACHE_T *es = NULL; + + if (fid->attr == attr) { + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + return FFS_SUCCESS; + } + + if (is_dir) { + if ((fid->dir.dir == p_fs->root_dir) && + (fid->entry == -1)) { + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + return FFS_SUCCESS; + } + } + + /* get the directory entry of given file */ + if (p_fs->vol_type == EXFAT) { + es = get_entry_set_in_dir(sb, &(fid->dir), fid->entry, ES_ALL_ENTRIES, &ep); + if (es == NULL) + return FFS_MEDIAERR; + } else { + ep = get_entry_in_dir(sb, &(fid->dir), fid->entry, §or); + if (!ep) + return FFS_MEDIAERR; + } + + type = p_fs->fs_func->get_entry_type(ep); + + if (((type == TYPE_FILE) && (attr & ATTR_SUBDIR)) || + ((type == TYPE_DIR) && (!(attr & ATTR_SUBDIR)))) { + s32 err; + if (p_fs->dev_ejected) + err = FFS_MEDIAERR; + else + err = FFS_ERROR; + + if (p_fs->vol_type == EXFAT) + release_entry_set(es); + return err; + } + + fs_set_vol_flags(sb, VOL_DIRTY); + + /* set the file attribute */ + fid->attr = attr; + p_fs->fs_func->set_entry_attr(ep, attr); + + if (p_fs->vol_type != EXFAT) + buf_modify(sb, sector); + else { + update_dir_checksum_with_entry_set(sb, es); + release_entry_set(es); + } + +#ifdef CONFIG_EXFAT_DELAYED_SYNC + fs_sync(sb, 0); + fs_set_vol_flags(sb, VOL_CLEAN); +#endif + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsSetAttr */ + +/* ffsGetStat : get the information of a given file */ +s32 ffsGetStat(struct inode *inode, DIR_ENTRY_T *info) +{ + u32 sector = 0; + s32 count; + CHAIN_T dir; + UNI_NAME_T uni_name; + TIMESTAMP_T tm; + DENTRY_T *ep, *ep2; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + FILE_ID_T *fid = &(EXFAT_I(inode)->fid); + ENTRY_SET_CACHE_T *es = NULL; + u8 is_dir = (fid->type == TYPE_DIR) ? 1 : 0; + + DPRINTK("ffsGetStat entered\n"); + + if (is_dir) { + if ((fid->dir.dir == p_fs->root_dir) && + (fid->entry == -1)) { + info->Attr = ATTR_SUBDIR; + memset((char *) &info->CreateTimestamp, 0, sizeof(DATE_TIME_T)); + memset((char *) &info->ModifyTimestamp, 0, sizeof(DATE_TIME_T)); + memset((char *) &info->AccessTimestamp, 0, sizeof(DATE_TIME_T)); + strcpy(info->ShortName, "."); + strcpy(info->Name, "."); + + dir.dir = p_fs->root_dir; + dir.flags = 0x01; + + if (p_fs->root_dir == CLUSTER_32(0)) /* FAT16 root_dir */ + info->Size = p_fs->dentries_in_root << DENTRY_SIZE_BITS; + else + info->Size = count_num_clusters(sb, &dir) << p_fs->cluster_size_bits; + + count = count_dos_name_entries(sb, &dir, TYPE_DIR); + if (count < 0) + return FFS_MEDIAERR; + info->NumSubdirs = count; + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + return FFS_SUCCESS; + } + } + + /* get the directory entry of given file or directory */ + if (p_fs->vol_type == EXFAT) { + es = get_entry_set_in_dir(sb, &(fid->dir), fid->entry, ES_2_ENTRIES, &ep); + if (es == NULL) + return FFS_MEDIAERR; + ep2 = ep+1; + } else { + ep = get_entry_in_dir(sb, &(fid->dir), fid->entry, §or); + if (!ep) + return FFS_MEDIAERR; + ep2 = ep; + buf_lock(sb, sector); + } + + /* set FILE_INFO structure using the acquired DENTRY_T */ + info->Attr = p_fs->fs_func->get_entry_attr(ep); + + p_fs->fs_func->get_entry_time(ep, &tm, TM_CREATE); + info->CreateTimestamp.Year = tm.year; + info->CreateTimestamp.Month = tm.mon; + info->CreateTimestamp.Day = tm.day; + info->CreateTimestamp.Hour = tm.hour; + info->CreateTimestamp.Minute = tm.min; + info->CreateTimestamp.Second = tm.sec; + info->CreateTimestamp.MilliSecond = 0; + + p_fs->fs_func->get_entry_time(ep, &tm, TM_MODIFY); + info->ModifyTimestamp.Year = tm.year; + info->ModifyTimestamp.Month = tm.mon; + info->ModifyTimestamp.Day = tm.day; + info->ModifyTimestamp.Hour = tm.hour; + info->ModifyTimestamp.Minute = tm.min; + info->ModifyTimestamp.Second = tm.sec; + info->ModifyTimestamp.MilliSecond = 0; + + memset((char *) &info->AccessTimestamp, 0, sizeof(DATE_TIME_T)); + + *(uni_name.name) = 0x0; + /* XXX this is very bad for exfat cuz name is already included in es. + API should be revised */ + p_fs->fs_func->get_uni_name_from_ext_entry(sb, &(fid->dir), fid->entry, uni_name.name); + if (*(uni_name.name) == 0x0 && p_fs->vol_type != EXFAT) + get_uni_name_from_dos_entry(sb, (DOS_DENTRY_T *) ep, &uni_name, 0x1); + nls_uniname_to_cstring(sb, info->Name, &uni_name); + + if (p_fs->vol_type == EXFAT) { + info->NumSubdirs = 2; + } else { + buf_unlock(sb, sector); + get_uni_name_from_dos_entry(sb, (DOS_DENTRY_T *) ep, &uni_name, 0x0); + nls_uniname_to_cstring(sb, info->ShortName, &uni_name); + info->NumSubdirs = 0; + } + + info->Size = p_fs->fs_func->get_entry_size(ep2); + + if (p_fs->vol_type == EXFAT) + release_entry_set(es); + + if (is_dir) { + dir.dir = fid->start_clu; + dir.flags = 0x01; + + if (info->Size == 0) + info->Size = (u64) count_num_clusters(sb, &dir) << p_fs->cluster_size_bits; + + count = count_dos_name_entries(sb, &dir, TYPE_DIR); + if (count < 0) + return FFS_MEDIAERR; + info->NumSubdirs += count; + } + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + DPRINTK("ffsGetStat exited successfully\n"); + return FFS_SUCCESS; +} /* end of ffsGetStat */ + +/* ffsSetStat : set the information of a given file */ +s32 ffsSetStat(struct inode *inode, DIR_ENTRY_T *info) +{ + u32 sector = 0; + TIMESTAMP_T tm; + DENTRY_T *ep, *ep2; + ENTRY_SET_CACHE_T *es = NULL; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + FILE_ID_T *fid = &(EXFAT_I(inode)->fid); + u8 is_dir = (fid->type == TYPE_DIR) ? 1 : 0; + + if (is_dir) { + if ((fid->dir.dir == p_fs->root_dir) && + (fid->entry == -1)) { + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + return FFS_SUCCESS; + } + } + + fs_set_vol_flags(sb, VOL_DIRTY); + + /* get the directory entry of given file or directory */ + if (p_fs->vol_type == EXFAT) { + es = get_entry_set_in_dir(sb, &(fid->dir), fid->entry, ES_ALL_ENTRIES, &ep); + if (es == NULL) + return FFS_MEDIAERR; + ep2 = ep+1; + } else { + /* for other than exfat */ + ep = get_entry_in_dir(sb, &(fid->dir), fid->entry, §or); + if (!ep) + return FFS_MEDIAERR; + ep2 = ep; + } + + + p_fs->fs_func->set_entry_attr(ep, info->Attr); + + /* set FILE_INFO structure using the acquired DENTRY_T */ + tm.sec = info->CreateTimestamp.Second; + tm.min = info->CreateTimestamp.Minute; + tm.hour = info->CreateTimestamp.Hour; + tm.day = info->CreateTimestamp.Day; + tm.mon = info->CreateTimestamp.Month; + tm.year = info->CreateTimestamp.Year; + p_fs->fs_func->set_entry_time(ep, &tm, TM_CREATE); + + tm.sec = info->ModifyTimestamp.Second; + tm.min = info->ModifyTimestamp.Minute; + tm.hour = info->ModifyTimestamp.Hour; + tm.day = info->ModifyTimestamp.Day; + tm.mon = info->ModifyTimestamp.Month; + tm.year = info->ModifyTimestamp.Year; + p_fs->fs_func->set_entry_time(ep, &tm, TM_MODIFY); + + + p_fs->fs_func->set_entry_size(ep2, info->Size); + + if (p_fs->vol_type != EXFAT) { + buf_modify(sb, sector); + } else { + update_dir_checksum_with_entry_set(sb, es); + release_entry_set(es); + } + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsSetStat */ + +s32 ffsMapCluster(struct inode *inode, s32 clu_offset, u32 *clu) +{ + s32 num_clusters, num_alloced, modified = FALSE; + u32 last_clu, sector = 0; + CHAIN_T new_clu; + DENTRY_T *ep; + ENTRY_SET_CACHE_T *es = NULL; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + FILE_ID_T *fid = &(EXFAT_I(inode)->fid); + + fid->rwoffset = (s64)(clu_offset) << p_fs->cluster_size_bits; + + if (EXFAT_I(inode)->mmu_private == 0) + num_clusters = 0; + else + num_clusters = (s32)((EXFAT_I(inode)->mmu_private-1) >> p_fs->cluster_size_bits) + 1; + + *clu = last_clu = fid->start_clu; + + if (fid->flags == 0x03) { + if ((clu_offset > 0) && (*clu != CLUSTER_32(~0))) { + last_clu += clu_offset - 1; + + if (clu_offset == num_clusters) + *clu = CLUSTER_32(~0); + else + *clu += clu_offset; + } + } else { + /* hint information */ + if ((clu_offset > 0) && (fid->hint_last_off > 0) && + (clu_offset >= fid->hint_last_off)) { + clu_offset -= fid->hint_last_off; + *clu = fid->hint_last_clu; + } + + while ((clu_offset > 0) && (*clu != CLUSTER_32(~0))) { + last_clu = *clu; + if (FAT_read(sb, *clu, clu) == -1) + return FFS_MEDIAERR; + clu_offset--; + } + } + + if (*clu == CLUSTER_32(~0)) { + fs_set_vol_flags(sb, VOL_DIRTY); + + new_clu.dir = (last_clu == CLUSTER_32(~0)) ? CLUSTER_32(~0) : last_clu+1; + new_clu.size = 0; + new_clu.flags = fid->flags; + + /* (1) allocate a cluster */ + num_alloced = p_fs->fs_func->alloc_cluster(sb, 1, &new_clu); + if (num_alloced < 0) + return FFS_MEDIAERR; + else if (num_alloced == 0) + return FFS_FULL; + + /* (2) append to the FAT chain */ + if (last_clu == CLUSTER_32(~0)) { + if (new_clu.flags == 0x01) + fid->flags = 0x01; + fid->start_clu = new_clu.dir; + modified = TRUE; + } else { + if (new_clu.flags != fid->flags) { + exfat_chain_cont_cluster(sb, fid->start_clu, num_clusters); + fid->flags = 0x01; + modified = TRUE; + } + if (new_clu.flags == 0x01) + FAT_write(sb, last_clu, new_clu.dir); + } + + num_clusters += num_alloced; + *clu = new_clu.dir; + + if (p_fs->vol_type == EXFAT) { + es = get_entry_set_in_dir(sb, &(fid->dir), fid->entry, ES_ALL_ENTRIES, &ep); + if (es == NULL) + return FFS_MEDIAERR; + /* get stream entry */ + ep++; + } + + /* (3) update directory entry */ + if (modified) { + if (p_fs->vol_type != EXFAT) { + ep = get_entry_in_dir(sb, &(fid->dir), fid->entry, §or); + if (!ep) + return FFS_MEDIAERR; + } + + if (p_fs->fs_func->get_entry_flag(ep) != fid->flags) + p_fs->fs_func->set_entry_flag(ep, fid->flags); + + if (p_fs->fs_func->get_entry_clu0(ep) != fid->start_clu) + p_fs->fs_func->set_entry_clu0(ep, fid->start_clu); + + if (p_fs->vol_type != EXFAT) + buf_modify(sb, sector); + } + + if (p_fs->vol_type == EXFAT) { + update_dir_checksum_with_entry_set(sb, es); + release_entry_set(es); + } + + /* add number of new blocks to inode */ + inode->i_blocks += num_alloced << (p_fs->cluster_size_bits - 9); + } + + /* hint information */ + fid->hint_last_off = (s32)(fid->rwoffset >> p_fs->cluster_size_bits); + fid->hint_last_clu = *clu; + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsMapCluster */ + +/*----------------------------------------------------------------------*/ +/* Directory Operation Functions */ +/*----------------------------------------------------------------------*/ + +/* ffsCreateDir : create(make) a directory */ +s32 ffsCreateDir(struct inode *inode, char *path, FILE_ID_T *fid) +{ + s32 ret/*, dentry*/; + CHAIN_T dir; + UNI_NAME_T uni_name; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + DPRINTK("ffsCreateDir entered\n"); + + /* check the validity of directory name in the given old pathname */ + ret = resolve_path(inode, path, &dir, &uni_name); + if (ret) + return ret; + + fs_set_vol_flags(sb, VOL_DIRTY); + + ret = create_dir(inode, &dir, &uni_name, fid); + +#ifdef CONFIG_EXFAT_DELAYED_SYNC + fs_sync(sb, 0); + fs_set_vol_flags(sb, VOL_CLEAN); +#endif + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return ret; +} /* end of ffsCreateDir */ + +/* ffsReadDir : read a directory entry from the opened directory */ +s32 ffsReadDir(struct inode *inode, DIR_ENTRY_T *dir_entry) +{ + int i, dentry, clu_offset; + s32 dentries_per_clu, dentries_per_clu_bits = 0; + u32 type, sector; + CHAIN_T dir, clu; + UNI_NAME_T uni_name; + TIMESTAMP_T tm; + DENTRY_T *ep; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + FILE_ID_T *fid = &(EXFAT_I(inode)->fid); + + /* check if the given file ID is opened */ + if (fid->type != TYPE_DIR) + return FFS_PERMISSIONERR; + + if (fid->entry == -1) { + dir.dir = p_fs->root_dir; + dir.flags = 0x01; + } else { + dir.dir = fid->start_clu; + dir.size = (s32)(fid->size >> p_fs->cluster_size_bits); + dir.flags = fid->flags; + } + + dentry = (s32) fid->rwoffset; + + if (dir.dir == CLUSTER_32(0)) { /* FAT16 root_dir */ + dentries_per_clu = p_fs->dentries_in_root; + + if (dentry == dentries_per_clu) { + clu.dir = CLUSTER_32(~0); + } else { + clu.dir = dir.dir; + clu.size = dir.size; + clu.flags = dir.flags; + } + } else { + dentries_per_clu = p_fs->dentries_per_clu; + dentries_per_clu_bits = ilog2(dentries_per_clu); + + clu_offset = dentry >> dentries_per_clu_bits; + clu.dir = dir.dir; + clu.size = dir.size; + clu.flags = dir.flags; + + if (clu.flags == 0x03) { + clu.dir += clu_offset; + clu.size -= clu_offset; + } else { + /* hint_information */ + if ((clu_offset > 0) && (fid->hint_last_off > 0) && + (clu_offset >= fid->hint_last_off)) { + clu_offset -= fid->hint_last_off; + clu.dir = fid->hint_last_clu; + } + + while (clu_offset > 0) { + /* clu.dir = FAT_read(sb, clu.dir); */ + if (FAT_read(sb, clu.dir, &(clu.dir)) == -1) + return FFS_MEDIAERR; + + clu_offset--; + } + } + } + + while (clu.dir != CLUSTER_32(~0)) { + if (p_fs->dev_ejected) + break; + + if (dir.dir == CLUSTER_32(0)) /* FAT16 root_dir */ + i = dentry % dentries_per_clu; + else + i = dentry & (dentries_per_clu-1); + + for ( ; i < dentries_per_clu; i++, dentry++) { + ep = get_entry_in_dir(sb, &clu, i, §or); + if (!ep) + return FFS_MEDIAERR; + + type = p_fs->fs_func->get_entry_type(ep); + + if (type == TYPE_UNUSED) + break; + + if ((type != TYPE_FILE) && (type != TYPE_DIR)) + continue; + + buf_lock(sb, sector); + dir_entry->Attr = p_fs->fs_func->get_entry_attr(ep); + + p_fs->fs_func->get_entry_time(ep, &tm, TM_CREATE); + dir_entry->CreateTimestamp.Year = tm.year; + dir_entry->CreateTimestamp.Month = tm.mon; + dir_entry->CreateTimestamp.Day = tm.day; + dir_entry->CreateTimestamp.Hour = tm.hour; + dir_entry->CreateTimestamp.Minute = tm.min; + dir_entry->CreateTimestamp.Second = tm.sec; + dir_entry->CreateTimestamp.MilliSecond = 0; + + p_fs->fs_func->get_entry_time(ep, &tm, TM_MODIFY); + dir_entry->ModifyTimestamp.Year = tm.year; + dir_entry->ModifyTimestamp.Month = tm.mon; + dir_entry->ModifyTimestamp.Day = tm.day; + dir_entry->ModifyTimestamp.Hour = tm.hour; + dir_entry->ModifyTimestamp.Minute = tm.min; + dir_entry->ModifyTimestamp.Second = tm.sec; + dir_entry->ModifyTimestamp.MilliSecond = 0; + + memset((char *) &dir_entry->AccessTimestamp, 0, sizeof(DATE_TIME_T)); + + *(uni_name.name) = 0x0; + p_fs->fs_func->get_uni_name_from_ext_entry(sb, &dir, dentry, uni_name.name); + if (*(uni_name.name) == 0x0 && p_fs->vol_type != EXFAT) + get_uni_name_from_dos_entry(sb, (DOS_DENTRY_T *) ep, &uni_name, 0x1); + nls_uniname_to_cstring(sb, dir_entry->Name, &uni_name); + buf_unlock(sb, sector); + + if (p_fs->vol_type == EXFAT) { + ep = get_entry_in_dir(sb, &clu, i+1, NULL); + if (!ep) + return FFS_MEDIAERR; + } else { + get_uni_name_from_dos_entry(sb, (DOS_DENTRY_T *) ep, &uni_name, 0x0); + nls_uniname_to_cstring(sb, dir_entry->ShortName, &uni_name); + } + + dir_entry->Size = p_fs->fs_func->get_entry_size(ep); + + /* hint information */ + if (dir.dir == CLUSTER_32(0)) { /* FAT16 root_dir */ + } else { + fid->hint_last_off = dentry >> dentries_per_clu_bits; + fid->hint_last_clu = clu.dir; + } + + fid->rwoffset = (s64) ++dentry; + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; + } + + if (dir.dir == CLUSTER_32(0)) + break; /* FAT16 root_dir */ + + if (clu.flags == 0x03) { + if ((--clu.size) > 0) + clu.dir++; + else + clu.dir = CLUSTER_32(~0); + } else { + /* clu.dir = FAT_read(sb, clu.dir); */ + if (FAT_read(sb, clu.dir, &(clu.dir)) == -1) + return FFS_MEDIAERR; + } + } + + *(dir_entry->Name) = '\0'; + + fid->rwoffset = (s64) ++dentry; + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsReadDir */ + +/* ffsRemoveDir : remove a directory */ +s32 ffsRemoveDir(struct inode *inode, FILE_ID_T *fid) +{ + s32 dentry; + CHAIN_T dir, clu_to_free; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + dir.dir = fid->dir.dir; + dir.size = fid->dir.size; + dir.flags = fid->dir.flags; + + dentry = fid->entry; + + /* check if the file is "." or ".." */ + if (p_fs->vol_type != EXFAT) { + if ((dir.dir != p_fs->root_dir) && (dentry < 2)) + return FFS_PERMISSIONERR; + } + + clu_to_free.dir = fid->start_clu; + clu_to_free.size = (s32)((fid->size-1) >> p_fs->cluster_size_bits) + 1; + clu_to_free.flags = fid->flags; + + if (!is_dir_empty(sb, &clu_to_free)) + return FFS_FILEEXIST; + + fs_set_vol_flags(sb, VOL_DIRTY); + + /* (1) update the directory entry */ + remove_file(inode, &dir, dentry); + + /* (2) free the clusters */ + p_fs->fs_func->free_cluster(sb, &clu_to_free, 1); + + fid->size = 0; + fid->start_clu = CLUSTER_32(~0); + fid->flags = (p_fs->vol_type == EXFAT)? 0x03: 0x01; + +#ifdef CONFIG_EXFAT_DELAYED_SYNC + fs_sync(sb, 0); + fs_set_vol_flags(sb, VOL_CLEAN); +#endif + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + return FFS_SUCCESS; +} /* end of ffsRemoveDir */ + +/*======================================================================*/ +/* Local Function Definitions */ +/*======================================================================*/ + +/* + * File System Management Functions + */ + +s32 fs_init(void) +{ + /* critical check for system requirement on size of DENTRY_T structure */ + if (sizeof(DENTRY_T) != DENTRY_SIZE) + return FFS_ALIGNMENTERR; + + if (sizeof(DOS_DENTRY_T) != DENTRY_SIZE) + return FFS_ALIGNMENTERR; + + if (sizeof(EXT_DENTRY_T) != DENTRY_SIZE) + return FFS_ALIGNMENTERR; + + if (sizeof(FILE_DENTRY_T) != DENTRY_SIZE) + return FFS_ALIGNMENTERR; + + if (sizeof(STRM_DENTRY_T) != DENTRY_SIZE) + return FFS_ALIGNMENTERR; + + if (sizeof(NAME_DENTRY_T) != DENTRY_SIZE) + return FFS_ALIGNMENTERR; + + if (sizeof(BMAP_DENTRY_T) != DENTRY_SIZE) + return FFS_ALIGNMENTERR; + + if (sizeof(CASE_DENTRY_T) != DENTRY_SIZE) + return FFS_ALIGNMENTERR; + + if (sizeof(VOLM_DENTRY_T) != DENTRY_SIZE) + return FFS_ALIGNMENTERR; + + return FFS_SUCCESS; +} /* end of fs_init */ + +s32 fs_shutdown(void) +{ + return FFS_SUCCESS; +} /* end of fs_shutdown */ + +void fs_set_vol_flags(struct super_block *sb, u32 new_flag) +{ + PBR_SECTOR_T *p_pbr; + BPBEX_T *p_bpb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (p_fs->vol_flag == new_flag) + return; + + p_fs->vol_flag = new_flag; + + if (p_fs->vol_type == EXFAT) { + if (p_fs->pbr_bh == NULL) { + if (sector_read(sb, p_fs->PBR_sector, &(p_fs->pbr_bh), 1) != FFS_SUCCESS) + return; + } + + p_pbr = (PBR_SECTOR_T *) p_fs->pbr_bh->b_data; + p_bpb = (BPBEX_T *) p_pbr->bpb; + SET16(p_bpb->vol_flags, (u16) new_flag); + + /* XXX duyoung + what can we do here? (cuz fs_set_vol_flags() is void) */ + if ((new_flag == VOL_DIRTY) && (!buffer_dirty(p_fs->pbr_bh))) + sector_write(sb, p_fs->PBR_sector, p_fs->pbr_bh, 1); + else + sector_write(sb, p_fs->PBR_sector, p_fs->pbr_bh, 0); + } +} /* end of fs_set_vol_flags */ + +void fs_sync(struct super_block *sb, s32 do_sync) +{ + if (do_sync) + bdev_sync(sb); +} /* end of fs_sync */ + +void fs_error(struct super_block *sb) +{ + struct exfat_mount_options *opts = &EXFAT_SB(sb)->options; + + if (opts->errors == EXFAT_ERRORS_PANIC) + panic("[EXFAT] Filesystem panic from previous error\n"); + else if ((opts->errors == EXFAT_ERRORS_RO) && !(sb->s_flags & MS_RDONLY)) { + sb->s_flags |= MS_RDONLY; + printk(KERN_ERR "[EXFAT] Filesystem has been set read-only\n"); + } +} + +/* + * Cluster Management Functions + */ + +s32 clear_cluster(struct super_block *sb, u32 clu) +{ + u32 s, n; + s32 ret = FFS_SUCCESS; + struct buffer_head *tmp_bh = NULL; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + if (clu == CLUSTER_32(0)) { /* FAT16 root_dir */ + s = p_fs->root_start_sector; + n = p_fs->data_start_sector; + } else { + s = START_SECTOR(clu); + n = s + p_fs->sectors_per_clu; + } + + for (; s < n; s++) { + ret = sector_read(sb, s, &tmp_bh, 0); + if (ret != FFS_SUCCESS) + return ret; + + memset((char *) tmp_bh->b_data, 0x0, p_bd->sector_size); + ret = sector_write(sb, s, tmp_bh, 0); + if (ret != FFS_SUCCESS) + break; + } + + brelse(tmp_bh); + return ret; +} /* end of clear_cluster */ + +s32 fat_alloc_cluster(struct super_block *sb, s32 num_alloc, CHAIN_T *p_chain) +{ + int i, num_clusters = 0; + u32 new_clu, last_clu = CLUSTER_32(~0), read_clu; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + new_clu = p_chain->dir; + if (new_clu == CLUSTER_32(~0)) + new_clu = p_fs->clu_srch_ptr; + else if (new_clu >= p_fs->num_clusters) + new_clu = 2; + + __set_sb_dirty(sb); + + p_chain->dir = CLUSTER_32(~0); + + for (i = 2; i < p_fs->num_clusters; i++) { + if (FAT_read(sb, new_clu, &read_clu) != 0) + return -1; + + if (read_clu == CLUSTER_32(0)) { + if (FAT_write(sb, new_clu, CLUSTER_32(~0)) < 0) + return -1; + num_clusters++; + + if (p_chain->dir == CLUSTER_32(~0)) + p_chain->dir = new_clu; + else { + if (FAT_write(sb, last_clu, new_clu) < 0) + return -1; + } + + last_clu = new_clu; + + if ((--num_alloc) == 0) { + p_fs->clu_srch_ptr = new_clu; + if (p_fs->used_clusters != (u32) ~0) + p_fs->used_clusters += num_clusters; + + return num_clusters; + } + } + if ((++new_clu) >= p_fs->num_clusters) + new_clu = 2; + } + + p_fs->clu_srch_ptr = new_clu; + if (p_fs->used_clusters != (u32) ~0) + p_fs->used_clusters += num_clusters; + + return num_clusters; +} /* end of fat_alloc_cluster */ + +s32 exfat_alloc_cluster(struct super_block *sb, s32 num_alloc, CHAIN_T *p_chain) +{ + s32 num_clusters = 0; + u32 hint_clu, new_clu, last_clu = CLUSTER_32(~0); + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + hint_clu = p_chain->dir; + if (hint_clu == CLUSTER_32(~0)) { + hint_clu = test_alloc_bitmap(sb, p_fs->clu_srch_ptr-2); + if (hint_clu == CLUSTER_32(~0)) + return 0; + } else if (hint_clu >= p_fs->num_clusters) { + hint_clu = 2; + p_chain->flags = 0x01; + } + + __set_sb_dirty(sb); + + p_chain->dir = CLUSTER_32(~0); + + while ((new_clu = test_alloc_bitmap(sb, hint_clu-2)) != CLUSTER_32(~0)) { + if (new_clu != hint_clu) { + if (p_chain->flags == 0x03) { + exfat_chain_cont_cluster(sb, p_chain->dir, num_clusters); + p_chain->flags = 0x01; + } + } + + if (set_alloc_bitmap(sb, new_clu-2) != FFS_SUCCESS) + return -1; + + num_clusters++; + + if (p_chain->flags == 0x01) { + if (FAT_write(sb, new_clu, CLUSTER_32(~0)) < 0) + return -1; + } + + if (p_chain->dir == CLUSTER_32(~0)) { + p_chain->dir = new_clu; + } else { + if (p_chain->flags == 0x01) { + if (FAT_write(sb, last_clu, new_clu) < 0) + return -1; + } + } + last_clu = new_clu; + + if ((--num_alloc) == 0) { + p_fs->clu_srch_ptr = hint_clu; + if (p_fs->used_clusters != (u32) ~0) + p_fs->used_clusters += num_clusters; + + p_chain->size += num_clusters; + return num_clusters; + } + + hint_clu = new_clu + 1; + if (hint_clu >= p_fs->num_clusters) { + hint_clu = 2; + + if (p_chain->flags == 0x03) { + exfat_chain_cont_cluster(sb, p_chain->dir, num_clusters); + p_chain->flags = 0x01; + } + } + } + + p_fs->clu_srch_ptr = hint_clu; + if (p_fs->used_clusters != (u32) ~0) + p_fs->used_clusters += num_clusters; + + p_chain->size += num_clusters; + return num_clusters; +} /* end of exfat_alloc_cluster */ + +void fat_free_cluster(struct super_block *sb, CHAIN_T *p_chain, s32 do_relse) +{ + s32 num_clusters = 0; + u32 clu, prev; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + int i; + u32 sector; + + if ((p_chain->dir == CLUSTER_32(0)) || (p_chain->dir == CLUSTER_32(~0))) + return; + __set_sb_dirty(sb); + clu = p_chain->dir; + + if (p_chain->size <= 0) + return; + + do { + if (p_fs->dev_ejected) + break; + + if (do_relse) { + sector = START_SECTOR(clu); + for (i = 0; i < p_fs->sectors_per_clu; i++) + buf_release(sb, sector+i); + } + + prev = clu; + if (FAT_read(sb, clu, &clu) == -1) + break; + + if (FAT_write(sb, prev, CLUSTER_32(0)) < 0) + break; + num_clusters++; + + } while (clu != CLUSTER_32(~0)); + + if (p_fs->used_clusters != (u32) ~0) + p_fs->used_clusters -= num_clusters; +} /* end of fat_free_cluster */ + +void exfat_free_cluster(struct super_block *sb, CHAIN_T *p_chain, s32 do_relse) +{ + s32 num_clusters = 0; + u32 clu; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + int i; + u32 sector; + + if ((p_chain->dir == CLUSTER_32(0)) || (p_chain->dir == CLUSTER_32(~0))) + return; + + if (p_chain->size <= 0) { + printk(KERN_ERR "[EXFAT] free_cluster : skip free-req clu:%u, " + "because of zero-size truncation\n" + ,p_chain->dir); + return; + } + + __set_sb_dirty(sb); + clu = p_chain->dir; + + if (p_chain->flags == 0x03) { + do { + if (do_relse) { + sector = START_SECTOR(clu); + for (i = 0; i < p_fs->sectors_per_clu; i++) + buf_release(sb, sector+i); + } + + if (clr_alloc_bitmap(sb, clu-2) != FFS_SUCCESS) + break; + clu++; + + num_clusters++; + } while (num_clusters < p_chain->size); + } else { + do { + if (p_fs->dev_ejected) + break; + + if (do_relse) { + sector = START_SECTOR(clu); + for (i = 0; i < p_fs->sectors_per_clu; i++) + buf_release(sb, sector+i); + } + + if (clr_alloc_bitmap(sb, clu-2) != FFS_SUCCESS) + break; + + if (FAT_read(sb, clu, &clu) == -1) + break; + num_clusters++; + } while ((clu != CLUSTER_32(0)) && (clu != CLUSTER_32(~0))); + } + + if (p_fs->used_clusters != (u32) ~0) + p_fs->used_clusters -= num_clusters; +} /* end of exfat_free_cluster */ + +u32 find_last_cluster(struct super_block *sb, CHAIN_T *p_chain) +{ + u32 clu, next; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + clu = p_chain->dir; + + if (p_chain->flags == 0x03) { + clu += p_chain->size - 1; + } else { + while ((FAT_read(sb, clu, &next) == 0) && (next != CLUSTER_32(~0))) { + if (p_fs->dev_ejected) + break; + clu = next; + } + } + + return clu; +} /* end of find_last_cluster */ + +s32 count_num_clusters(struct super_block *sb, CHAIN_T *p_chain) +{ + int i, count = 0; + u32 clu; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if ((p_chain->dir == CLUSTER_32(0)) || (p_chain->dir == CLUSTER_32(~0))) + return 0; + + clu = p_chain->dir; + + if (p_chain->flags == 0x03) { + count = p_chain->size; + } else { + for (i = 2; i < p_fs->num_clusters; i++) { + count++; + if (FAT_read(sb, clu, &clu) != 0) + return 0; + if (clu == CLUSTER_32(~0)) + break; + } + } + + return count; +} /* end of count_num_clusters */ + +s32 fat_count_used_clusters(struct super_block *sb) +{ + int i, count = 0; + u32 clu; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + for (i = 2; i < p_fs->num_clusters; i++) { + if (FAT_read(sb, i, &clu) != 0) + break; + if (clu != CLUSTER_32(0)) + count++; + } + + return count; +} /* end of fat_count_used_clusters */ + +s32 exfat_count_used_clusters(struct super_block *sb) +{ + int i, map_i, map_b, count = 0; + u8 k; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + map_i = map_b = 0; + + for (i = 2; i < p_fs->num_clusters; i += 8) { + k = *(((u8 *) p_fs->vol_amap[map_i]->b_data) + map_b); + count += used_bit[k]; + + if ((++map_b) >= p_bd->sector_size) { + map_i++; + map_b = 0; + } + } + + return count; +} /* end of exfat_count_used_clusters */ + +void exfat_chain_cont_cluster(struct super_block *sb, u32 chain, s32 len) +{ + if (len == 0) + return; + + while (len > 1) { + if (FAT_write(sb, chain, chain+1) < 0) + break; + chain++; + len--; + } + FAT_write(sb, chain, CLUSTER_32(~0)); +} /* end of exfat_chain_cont_cluster */ + +/* + * Allocation Bitmap Management Functions + */ + +s32 load_alloc_bitmap(struct super_block *sb) +{ + int i, j, ret; + u32 map_size; + u32 type, sector; + CHAIN_T clu; + BMAP_DENTRY_T *ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + clu.dir = p_fs->root_dir; + clu.flags = 0x01; + + while (clu.dir != CLUSTER_32(~0)) { + if (p_fs->dev_ejected) + break; + + for (i = 0; i < p_fs->dentries_per_clu; i++) { + ep = (BMAP_DENTRY_T *) get_entry_in_dir(sb, &clu, i, NULL); + if (!ep) + return FFS_MEDIAERR; + + type = p_fs->fs_func->get_entry_type((DENTRY_T *) ep); + + if (type == TYPE_UNUSED) + break; + if (type != TYPE_BITMAP) + continue; + + if (ep->flags == 0x0) { + p_fs->map_clu = GET32_A(ep->start_clu); + map_size = (u32) GET64_A(ep->size); + + p_fs->map_sectors = ((map_size-1) >> p_bd->sector_size_bits) + 1; + + p_fs->vol_amap = (struct buffer_head **) kmalloc(sizeof(struct buffer_head *) * p_fs->map_sectors, GFP_KERNEL); + if (p_fs->vol_amap == NULL) + return FFS_MEMORYERR; + + sector = START_SECTOR(p_fs->map_clu); + + for (j = 0; j < p_fs->map_sectors; j++) { + p_fs->vol_amap[j] = NULL; + ret = sector_read(sb, sector+j, &(p_fs->vol_amap[j]), 1); + if (ret != FFS_SUCCESS) { + /* release all buffers and free vol_amap */ + i = 0; + while (i < j) + brelse(p_fs->vol_amap[i++]); + + if (p_fs->vol_amap) + kfree(p_fs->vol_amap); + p_fs->vol_amap = NULL; + return ret; + } + } + + p_fs->pbr_bh = NULL; + return FFS_SUCCESS; + } + } + + if (FAT_read(sb, clu.dir, &(clu.dir)) != 0) + return FFS_MEDIAERR; + } + + return FFS_FORMATERR; +} /* end of load_alloc_bitmap */ + +void free_alloc_bitmap(struct super_block *sb) +{ + int i; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + brelse(p_fs->pbr_bh); + + for (i = 0; i < p_fs->map_sectors; i++) + __brelse(p_fs->vol_amap[i]); + + if (p_fs->vol_amap) + kfree(p_fs->vol_amap); + p_fs->vol_amap = NULL; +} /* end of free_alloc_bitmap */ + +s32 set_alloc_bitmap(struct super_block *sb, u32 clu) +{ + int i, b; + u32 sector; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + i = clu >> (p_bd->sector_size_bits + 3); + b = clu & ((p_bd->sector_size << 3) - 1); + + sector = START_SECTOR(p_fs->map_clu) + i; + + exfat_bitmap_set((u8 *) p_fs->vol_amap[i]->b_data, b); + + return sector_write(sb, sector, p_fs->vol_amap[i], 0); +} /* end of set_alloc_bitmap */ + +s32 clr_alloc_bitmap(struct super_block *sb, u32 clu) +{ + int i, b; + u32 sector; +#ifdef CONFIG_EXFAT_DISCARD + struct exfat_sb_info *sbi = EXFAT_SB(sb); + struct exfat_mount_options *opts = &sbi->options; + int ret; +#endif /* CONFIG_EXFAT_DISCARD */ + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + i = clu >> (p_bd->sector_size_bits + 3); + b = clu & ((p_bd->sector_size << 3) - 1); + + sector = START_SECTOR(p_fs->map_clu) + i; + + exfat_bitmap_clear((u8 *) p_fs->vol_amap[i]->b_data, b); + + return sector_write(sb, sector, p_fs->vol_amap[i], 0); + +#ifdef CONFIG_EXFAT_DISCARD + if (opts->discard) { +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,37) + ret = sb_issue_discard(sb, START_SECTOR(clu), (1 << p_fs->sectors_per_clu_bits)); +#else + ret = sb_issue_discard(sb, START_SECTOR(clu), (1 << p_fs->sectors_per_clu_bits), GFP_NOFS, 0); +#endif + if (ret == -EOPNOTSUPP) { + printk(KERN_WARNING "discard not supported by device, disabling"); + opts->discard = 0; + } + } +#endif /* CONFIG_EXFAT_DISCARD */ +} /* end of clr_alloc_bitmap */ + +u32 test_alloc_bitmap(struct super_block *sb, u32 clu) +{ + int i, map_i, map_b; + u32 clu_base, clu_free; + u8 k, clu_mask; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + clu_base = (clu & ~(0x7)) + 2; + clu_mask = (1 << (clu - clu_base + 2)) - 1; + + map_i = clu >> (p_bd->sector_size_bits + 3); + map_b = (clu >> 3) & p_bd->sector_size_mask; + + for (i = 2; i < p_fs->num_clusters; i += 8) { + k = *(((u8 *) p_fs->vol_amap[map_i]->b_data) + map_b); + if (clu_mask > 0) { + k |= clu_mask; + clu_mask = 0; + } + if (k < 0xFF) { + clu_free = clu_base + free_bit[k]; + if (clu_free < p_fs->num_clusters) + return clu_free; + } + clu_base += 8; + + if (((++map_b) >= p_bd->sector_size) || (clu_base >= p_fs->num_clusters)) { + if ((++map_i) >= p_fs->map_sectors) { + clu_base = 2; + map_i = 0; + } + map_b = 0; + } + } + + return CLUSTER_32(~0); +} /* end of test_alloc_bitmap */ + +void sync_alloc_bitmap(struct super_block *sb) +{ + int i; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (p_fs->vol_amap == NULL) + return; + + for (i = 0; i < p_fs->map_sectors; i++) + sync_dirty_buffer(p_fs->vol_amap[i]); +} /* end of sync_alloc_bitmap */ + +/* + * Upcase table Management Functions + */ +s32 __load_upcase_table(struct super_block *sb, u32 sector, u32 num_sectors, u32 utbl_checksum) +{ + int i, ret = FFS_ERROR; + u32 j; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + struct buffer_head *tmp_bh = NULL; + + u8 skip = FALSE; + u32 index = 0; + u16 uni = 0; + u16 **upcase_table; + + u32 checksum = 0; + + upcase_table = p_fs->vol_utbl = (u16 **) kmalloc(UTBL_COL_COUNT * sizeof(u16 *), GFP_KERNEL); + if (upcase_table == NULL) + return FFS_MEMORYERR; + memset(upcase_table, 0, UTBL_COL_COUNT * sizeof(u16 *)); + + num_sectors += sector; + + while (sector < num_sectors) { + ret = sector_read(sb, sector, &tmp_bh, 1); + if (ret != FFS_SUCCESS) { + DPRINTK("sector read (0x%X)fail\n", sector); + goto error; + } + sector++; + + for (i = 0; i < p_bd->sector_size && index <= 0xFFFF; i += 2) { + uni = GET16(((u8 *) tmp_bh->b_data)+i); + + checksum = ((checksum & 1) ? 0x80000000 : 0) + (checksum >> 1) + *(((u8 *) tmp_bh->b_data)+i); + checksum = ((checksum & 1) ? 0x80000000 : 0) + (checksum >> 1) + *(((u8 *) tmp_bh->b_data)+(i+1)); + + if (skip) { + DPRINTK("skip from 0x%X ", index); + index += uni; + DPRINTK("to 0x%X (amount of 0x%X)\n", index, uni); + skip = FALSE; + } else if (uni == index) + index++; + else if (uni == 0xFFFF) + skip = TRUE; + else { /* uni != index , uni != 0xFFFF */ + u16 col_index = get_col_index(index); + + if (upcase_table[col_index] == NULL) { + DPRINTK("alloc = 0x%X\n", col_index); + upcase_table[col_index] = (u16 *) kmalloc(UTBL_ROW_COUNT * sizeof(u16), GFP_KERNEL); + if (upcase_table[col_index] == NULL) { + ret = FFS_MEMORYERR; + goto error; + } + + for (j = 0; j < UTBL_ROW_COUNT; j++) + upcase_table[col_index][j] = (col_index << LOW_INDEX_BIT) | j; + } + + upcase_table[col_index][get_row_index(index)] = uni; + index++; + } + } + } + if (index >= 0xFFFF && utbl_checksum == checksum) { + if (tmp_bh) + brelse(tmp_bh); + return FFS_SUCCESS; + } + ret = FFS_ERROR; +error: + if (tmp_bh) + brelse(tmp_bh); + free_upcase_table(sb); + return ret; +} + +s32 __load_default_upcase_table(struct super_block *sb) +{ + int i, ret = FFS_ERROR; + u32 j; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + u8 skip = FALSE; + u32 index = 0; + u16 uni = 0; + u16 **upcase_table; + + upcase_table = p_fs->vol_utbl = (u16 **) kmalloc(UTBL_COL_COUNT * sizeof(u16 *), GFP_KERNEL); + if (upcase_table == NULL) + return FFS_MEMORYERR; + memset(upcase_table, 0, UTBL_COL_COUNT * sizeof(u16 *)); + + for (i = 0; index <= 0xFFFF && i < NUM_UPCASE*2; i += 2) { + uni = GET16(uni_upcase + i); + if (skip) { + DPRINTK("skip from 0x%X ", index); + index += uni; + DPRINTK("to 0x%X (amount of 0x%X)\n", index, uni); + skip = FALSE; + } else if (uni == index) + index++; + else if (uni == 0xFFFF) + skip = TRUE; + else { /* uni != index , uni != 0xFFFF */ + u16 col_index = get_col_index(index); + + if (upcase_table[col_index] == NULL) { + DPRINTK("alloc = 0x%X\n", col_index); + upcase_table[col_index] = (u16 *) kmalloc(UTBL_ROW_COUNT * sizeof(u16), GFP_KERNEL); + if (upcase_table[col_index] == NULL) { + ret = FFS_MEMORYERR; + goto error; + } + + for (j = 0; j < UTBL_ROW_COUNT; j++) + upcase_table[col_index][j] = (col_index << LOW_INDEX_BIT) | j; + } + + upcase_table[col_index][get_row_index(index)] = uni; + index++; + } + } + + if (index >= 0xFFFF) + return FFS_SUCCESS; + +error: + /* FATAL error: default upcase table has error */ + free_upcase_table(sb); + return ret; +} + +s32 load_upcase_table(struct super_block *sb) +{ + int i; + u32 tbl_clu, tbl_size; + u32 type, sector, num_sectors; + CHAIN_T clu; + CASE_DENTRY_T *ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + clu.dir = p_fs->root_dir; + clu.flags = 0x01; + + if (p_fs->dev_ejected) + return FFS_MEDIAERR; + + while (clu.dir != CLUSTER_32(~0)) { + for (i = 0; i < p_fs->dentries_per_clu; i++) { + ep = (CASE_DENTRY_T *) get_entry_in_dir(sb, &clu, i, NULL); + if (!ep) + return FFS_MEDIAERR; + + type = p_fs->fs_func->get_entry_type((DENTRY_T *) ep); + + if (type == TYPE_UNUSED) + break; + if (type != TYPE_UPCASE) + continue; + + tbl_clu = GET32_A(ep->start_clu); + tbl_size = (u32) GET64_A(ep->size); + + sector = START_SECTOR(tbl_clu); + num_sectors = ((tbl_size-1) >> p_bd->sector_size_bits) + 1; + if (__load_upcase_table(sb, sector, num_sectors, GET32_A(ep->checksum)) != FFS_SUCCESS) + break; + else + return FFS_SUCCESS; + } + if (FAT_read(sb, clu.dir, &(clu.dir)) != 0) + return FFS_MEDIAERR; + } + /* load default upcase table */ + return __load_default_upcase_table(sb); +} /* end of load_upcase_table */ + +void free_upcase_table(struct super_block *sb) +{ + u32 i; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + u16 **upcase_table; + + upcase_table = p_fs->vol_utbl; + for (i = 0; i < UTBL_COL_COUNT; i++) { + if (upcase_table[i]) + kfree(upcase_table[i]); + } + + if (p_fs->vol_utbl) + kfree(p_fs->vol_utbl); + p_fs->vol_utbl = NULL; +} /* end of free_upcase_table */ + +/* + * Directory Entry Management Functions + */ + +u32 fat_get_entry_type(DENTRY_T *p_entry) +{ + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + + if (*(ep->name) == 0x0) + return TYPE_UNUSED; + + else if (*(ep->name) == 0xE5) + return TYPE_DELETED; + + else if (ep->attr == ATTR_EXTEND) + return TYPE_EXTEND; + + else if ((ep->attr & (ATTR_SUBDIR|ATTR_VOLUME)) == ATTR_VOLUME) + return TYPE_VOLUME; + + else if ((ep->attr & (ATTR_SUBDIR|ATTR_VOLUME)) == ATTR_SUBDIR) + return TYPE_DIR; + + return TYPE_FILE; +} /* end of fat_get_entry_type */ + +u32 exfat_get_entry_type(DENTRY_T *p_entry) +{ + FILE_DENTRY_T *ep = (FILE_DENTRY_T *) p_entry; + + if (ep->type == 0x0) { + return TYPE_UNUSED; + } else if (ep->type < 0x80) { + return TYPE_DELETED; + } else if (ep->type == 0x80) { + return TYPE_INVALID; + } else if (ep->type < 0xA0) { + if (ep->type == 0x81) { + return TYPE_BITMAP; + } else if (ep->type == 0x82) { + return TYPE_UPCASE; + } else if (ep->type == 0x83) { + return TYPE_VOLUME; + } else if (ep->type == 0x85) { + if (GET16_A(ep->attr) & ATTR_SUBDIR) + return TYPE_DIR; + else + return TYPE_FILE; + } + return TYPE_CRITICAL_PRI; + } else if (ep->type < 0xC0) { + if (ep->type == 0xA0) + return TYPE_GUID; + else if (ep->type == 0xA1) + return TYPE_PADDING; + else if (ep->type == 0xA2) + return TYPE_ACLTAB; + return TYPE_BENIGN_PRI; + } else if (ep->type < 0xE0) { + if (ep->type == 0xC0) + return TYPE_STREAM; + else if (ep->type == 0xC1) + return TYPE_EXTEND; + else if (ep->type == 0xC2) + return TYPE_ACL; + return TYPE_CRITICAL_SEC; + } + + return TYPE_BENIGN_SEC; +} /* end of exfat_get_entry_type */ + +void fat_set_entry_type(DENTRY_T *p_entry, u32 type) +{ + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + + if (type == TYPE_UNUSED) + *(ep->name) = 0x0; + + else if (type == TYPE_DELETED) + *(ep->name) = 0xE5; + + else if (type == TYPE_EXTEND) + ep->attr = ATTR_EXTEND; + + else if (type == TYPE_DIR) + ep->attr = ATTR_SUBDIR; + + else if (type == TYPE_FILE) + ep->attr = ATTR_ARCHIVE; + + else if (type == TYPE_SYMLINK) + ep->attr = ATTR_ARCHIVE | ATTR_SYMLINK; +} /* end of fat_set_entry_type */ + +void exfat_set_entry_type(DENTRY_T *p_entry, u32 type) +{ + FILE_DENTRY_T *ep = (FILE_DENTRY_T *) p_entry; + + if (type == TYPE_UNUSED) { + ep->type = 0x0; + } else if (type == TYPE_DELETED) { + ep->type &= ~0x80; + } else if (type == TYPE_STREAM) { + ep->type = 0xC0; + } else if (type == TYPE_EXTEND) { + ep->type = 0xC1; + } else if (type == TYPE_BITMAP) { + ep->type = 0x81; + } else if (type == TYPE_UPCASE) { + ep->type = 0x82; + } else if (type == TYPE_VOLUME) { + ep->type = 0x83; + } else if (type == TYPE_DIR) { + ep->type = 0x85; + SET16_A(ep->attr, ATTR_SUBDIR); + } else if (type == TYPE_FILE) { + ep->type = 0x85; + SET16_A(ep->attr, ATTR_ARCHIVE); + } else if (type == TYPE_SYMLINK) { + ep->type = 0x85; + SET16_A(ep->attr, ATTR_ARCHIVE | ATTR_SYMLINK); + } +} /* end of exfat_set_entry_type */ + +u32 fat_get_entry_attr(DENTRY_T *p_entry) +{ + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + return (u32) ep->attr; +} /* end of fat_get_entry_attr */ + +u32 exfat_get_entry_attr(DENTRY_T *p_entry) +{ + FILE_DENTRY_T *ep = (FILE_DENTRY_T *) p_entry; + return (u32) GET16_A(ep->attr); +} /* end of exfat_get_entry_attr */ + +void fat_set_entry_attr(DENTRY_T *p_entry, u32 attr) +{ + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + ep->attr = (u8) attr; +} /* end of fat_set_entry_attr */ + +void exfat_set_entry_attr(DENTRY_T *p_entry, u32 attr) +{ + FILE_DENTRY_T *ep = (FILE_DENTRY_T *) p_entry; + SET16_A(ep->attr, (u16) attr); +} /* end of exfat_set_entry_attr */ + +u8 fat_get_entry_flag(DENTRY_T *p_entry) +{ + return 0x01; +} /* end of fat_get_entry_flag */ + +u8 exfat_get_entry_flag(DENTRY_T *p_entry) +{ + STRM_DENTRY_T *ep = (STRM_DENTRY_T *) p_entry; + return ep->flags; +} /* end of exfat_get_entry_flag */ + +void fat_set_entry_flag(DENTRY_T *p_entry, u8 flags) +{ +} /* end of fat_set_entry_flag */ + +void exfat_set_entry_flag(DENTRY_T *p_entry, u8 flags) +{ + STRM_DENTRY_T *ep = (STRM_DENTRY_T *) p_entry; + ep->flags = flags; +} /* end of exfat_set_entry_flag */ + +u32 fat_get_entry_clu0(DENTRY_T *p_entry) +{ + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + return ((u32) GET16_A(ep->start_clu_hi) << 16) | GET16_A(ep->start_clu_lo); +} /* end of fat_get_entry_clu0 */ + +u32 exfat_get_entry_clu0(DENTRY_T *p_entry) +{ + STRM_DENTRY_T *ep = (STRM_DENTRY_T *) p_entry; + return GET32_A(ep->start_clu); +} /* end of exfat_get_entry_clu0 */ + +void fat_set_entry_clu0(DENTRY_T *p_entry, u32 start_clu) +{ + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + SET16_A(ep->start_clu_lo, CLUSTER_16(start_clu)); + SET16_A(ep->start_clu_hi, CLUSTER_16(start_clu >> 16)); +} /* end of fat_set_entry_clu0 */ + +void exfat_set_entry_clu0(DENTRY_T *p_entry, u32 start_clu) +{ + STRM_DENTRY_T *ep = (STRM_DENTRY_T *) p_entry; + SET32_A(ep->start_clu, start_clu); +} /* end of exfat_set_entry_clu0 */ + +u64 fat_get_entry_size(DENTRY_T *p_entry) +{ + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + return (u64) GET32_A(ep->size); +} /* end of fat_get_entry_size */ + +u64 exfat_get_entry_size(DENTRY_T *p_entry) +{ + STRM_DENTRY_T *ep = (STRM_DENTRY_T *) p_entry; + return GET64_A(ep->valid_size); +} /* end of exfat_get_entry_size */ + +void fat_set_entry_size(DENTRY_T *p_entry, u64 size) +{ + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + SET32_A(ep->size, (u32) size); +} /* end of fat_set_entry_size */ + +void exfat_set_entry_size(DENTRY_T *p_entry, u64 size) +{ + STRM_DENTRY_T *ep = (STRM_DENTRY_T *) p_entry; + SET64_A(ep->valid_size, size); + SET64_A(ep->size, size); +} /* end of exfat_set_entry_size */ + +void fat_get_entry_time(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode) +{ + u16 t = 0x00, d = 0x21; + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + + switch (mode) { + case TM_CREATE: + t = GET16_A(ep->create_time); + d = GET16_A(ep->create_date); + break; + case TM_MODIFY: + t = GET16_A(ep->modify_time); + d = GET16_A(ep->modify_date); + break; + } + + tp->sec = (t & 0x001F) << 1; + tp->min = (t >> 5) & 0x003F; + tp->hour = (t >> 11); + tp->day = (d & 0x001F); + tp->mon = (d >> 5) & 0x000F; + tp->year = (d >> 9); +} /* end of fat_get_entry_time */ + +void exfat_get_entry_time(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode) +{ + u16 t = 0x00, d = 0x21; + FILE_DENTRY_T *ep = (FILE_DENTRY_T *) p_entry; + + switch (mode) { + case TM_CREATE: + t = GET16_A(ep->create_time); + d = GET16_A(ep->create_date); + break; + case TM_MODIFY: + t = GET16_A(ep->modify_time); + d = GET16_A(ep->modify_date); + break; + case TM_ACCESS: + t = GET16_A(ep->access_time); + d = GET16_A(ep->access_date); + break; + } + + tp->sec = (t & 0x001F) << 1; + tp->min = (t >> 5) & 0x003F; + tp->hour = (t >> 11); + tp->day = (d & 0x001F); + tp->mon = (d >> 5) & 0x000F; + tp->year = (d >> 9); +} /* end of exfat_get_entry_time */ + +void fat_set_entry_time(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode) +{ + u16 t, d; + DOS_DENTRY_T *ep = (DOS_DENTRY_T *) p_entry; + + t = (tp->hour << 11) | (tp->min << 5) | (tp->sec >> 1); + d = (tp->year << 9) | (tp->mon << 5) | tp->day; + + switch (mode) { + case TM_CREATE: + SET16_A(ep->create_time, t); + SET16_A(ep->create_date, d); + break; + case TM_MODIFY: + SET16_A(ep->modify_time, t); + SET16_A(ep->modify_date, d); + break; + } +} /* end of fat_set_entry_time */ + +void exfat_set_entry_time(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode) +{ + u16 t, d; + FILE_DENTRY_T *ep = (FILE_DENTRY_T *) p_entry; + + t = (tp->hour << 11) | (tp->min << 5) | (tp->sec >> 1); + d = (tp->year << 9) | (tp->mon << 5) | tp->day; + + switch (mode) { + case TM_CREATE: + SET16_A(ep->create_time, t); + SET16_A(ep->create_date, d); + break; + case TM_MODIFY: + SET16_A(ep->modify_time, t); + SET16_A(ep->modify_date, d); + break; + case TM_ACCESS: + SET16_A(ep->access_time, t); + SET16_A(ep->access_date, d); + break; + } +} /* end of exfat_set_entry_time */ + +s32 fat_init_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 type, + u32 start_clu, u64 size) +{ + u32 sector; + DOS_DENTRY_T *dos_ep; + + dos_ep = (DOS_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry, §or); + if (!dos_ep) + return FFS_MEDIAERR; + + init_dos_entry(dos_ep, type, start_clu); + buf_modify(sb, sector); + + return FFS_SUCCESS; +} /* end of fat_init_dir_entry */ + +s32 exfat_init_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 type, + u32 start_clu, u64 size) +{ + u32 sector; + u8 flags; + FILE_DENTRY_T *file_ep; + STRM_DENTRY_T *strm_ep; + + flags = (type == TYPE_FILE) ? 0x01 : 0x03; + + /* we cannot use get_entry_set_in_dir here because file ep is not initialized yet */ + file_ep = (FILE_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry, §or); + if (!file_ep) + return FFS_MEDIAERR; + + strm_ep = (STRM_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry+1, §or); + if (!strm_ep) + return FFS_MEDIAERR; + + init_file_entry(file_ep, type); + buf_modify(sb, sector); + + init_strm_entry(strm_ep, flags, start_clu, size); + buf_modify(sb, sector); + + return FFS_SUCCESS; +} /* end of exfat_init_dir_entry */ + +s32 fat_init_ext_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 num_entries, + UNI_NAME_T *p_uniname, DOS_NAME_T *p_dosname) +{ + int i; + u32 sector; + u8 chksum; + u16 *uniname = p_uniname->name; + DOS_DENTRY_T *dos_ep; + EXT_DENTRY_T *ext_ep; + + dos_ep = (DOS_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry, §or); + if (!dos_ep) + return FFS_MEDIAERR; + + dos_ep->lcase = p_dosname->name_case; + memcpy(dos_ep->name, p_dosname->name, DOS_NAME_LENGTH); + buf_modify(sb, sector); + + if ((--num_entries) > 0) { + chksum = calc_checksum_1byte((void *) dos_ep->name, DOS_NAME_LENGTH, 0); + + for (i = 1; i < num_entries; i++) { + ext_ep = (EXT_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry-i, §or); + if (!ext_ep) + return FFS_MEDIAERR; + + init_ext_entry(ext_ep, i, chksum, uniname); + buf_modify(sb, sector); + uniname += 13; + } + + ext_ep = (EXT_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry-i, §or); + if (!ext_ep) + return FFS_MEDIAERR; + + init_ext_entry(ext_ep, i+0x40, chksum, uniname); + buf_modify(sb, sector); + } + + return FFS_SUCCESS; +} /* end of fat_init_ext_entry */ + +s32 exfat_init_ext_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 num_entries, + UNI_NAME_T *p_uniname, DOS_NAME_T *p_dosname) +{ + int i; + u32 sector; + u16 *uniname = p_uniname->name; + FILE_DENTRY_T *file_ep; + STRM_DENTRY_T *strm_ep; + NAME_DENTRY_T *name_ep; + + file_ep = (FILE_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry, §or); + if (!file_ep) + return FFS_MEDIAERR; + + file_ep->num_ext = (u8)(num_entries - 1); + buf_modify(sb, sector); + + strm_ep = (STRM_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry+1, §or); + if (!strm_ep) + return FFS_MEDIAERR; + + strm_ep->name_len = p_uniname->name_len; + SET16_A(strm_ep->name_hash, p_uniname->name_hash); + buf_modify(sb, sector); + + for (i = 2; i < num_entries; i++) { + name_ep = (NAME_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry+i, §or); + if (!name_ep) + return FFS_MEDIAERR; + + init_name_entry(name_ep, uniname); + buf_modify(sb, sector); + uniname += 15; + } + + update_dir_checksum(sb, p_dir, entry); + + return FFS_SUCCESS; +} /* end of exfat_init_ext_entry */ + +void init_dos_entry(DOS_DENTRY_T *ep, u32 type, u32 start_clu) +{ + TIMESTAMP_T tm, *tp; + + fat_set_entry_type((DENTRY_T *) ep, type); + SET16_A(ep->start_clu_lo, CLUSTER_16(start_clu)); + SET16_A(ep->start_clu_hi, CLUSTER_16(start_clu >> 16)); + SET32_A(ep->size, 0); + + tp = tm_current(&tm); + fat_set_entry_time((DENTRY_T *) ep, tp, TM_CREATE); + fat_set_entry_time((DENTRY_T *) ep, tp, TM_MODIFY); + SET16_A(ep->access_date, 0); + ep->create_time_ms = 0; +} /* end of init_dos_entry */ + +void init_ext_entry(EXT_DENTRY_T *ep, s32 order, u8 chksum, u16 *uniname) +{ + int i; + u8 end = FALSE; + + fat_set_entry_type((DENTRY_T *) ep, TYPE_EXTEND); + ep->order = (u8) order; + ep->sysid = 0; + ep->checksum = chksum; + SET16_A(ep->start_clu, 0); + + for (i = 0; i < 10; i += 2) { + if (!end) { + SET16(ep->unicode_0_4+i, *uniname); + if (*uniname == 0x0) + end = TRUE; + else + uniname++; + } else { + SET16(ep->unicode_0_4+i, 0xFFFF); + } + } + + for (i = 0; i < 12; i += 2) { + if (!end) { + SET16_A(ep->unicode_5_10+i, *uniname); + if (*uniname == 0x0) + end = TRUE; + else + uniname++; + } else { + SET16_A(ep->unicode_5_10+i, 0xFFFF); + } + } + + for (i = 0; i < 4; i += 2) { + if (!end) { + SET16_A(ep->unicode_11_12+i, *uniname); + if (*uniname == 0x0) + end = TRUE; + else + uniname++; + } else { + SET16_A(ep->unicode_11_12+i, 0xFFFF); + } + } +} /* end of init_ext_entry */ + +void init_file_entry(FILE_DENTRY_T *ep, u32 type) +{ + TIMESTAMP_T tm, *tp; + + exfat_set_entry_type((DENTRY_T *) ep, type); + + tp = tm_current(&tm); + exfat_set_entry_time((DENTRY_T *) ep, tp, TM_CREATE); + exfat_set_entry_time((DENTRY_T *) ep, tp, TM_MODIFY); + exfat_set_entry_time((DENTRY_T *) ep, tp, TM_ACCESS); + ep->create_time_ms = 0; + ep->modify_time_ms = 0; + ep->access_time_ms = 0; +} /* end of init_file_entry */ + +void init_strm_entry(STRM_DENTRY_T *ep, u8 flags, u32 start_clu, u64 size) +{ + exfat_set_entry_type((DENTRY_T *) ep, TYPE_STREAM); + ep->flags = flags; + SET32_A(ep->start_clu, start_clu); + SET64_A(ep->valid_size, size); + SET64_A(ep->size, size); +} /* end of init_strm_entry */ + +void init_name_entry(NAME_DENTRY_T *ep, u16 *uniname) +{ + int i; + + exfat_set_entry_type((DENTRY_T *) ep, TYPE_EXTEND); + ep->flags = 0x0; + + for (i = 0; i < 30; i++, i++) { + SET16_A(ep->unicode_0_14+i, *uniname); + if (*uniname == 0x0) + break; + uniname++; + } +} /* end of init_name_entry */ + +void fat_delete_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 order, s32 num_entries) +{ + int i; + u32 sector; + DENTRY_T *ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + for (i = num_entries-1; i >= order; i--) { + ep = get_entry_in_dir(sb, p_dir, entry-i, §or); + if (!ep) + return; + + p_fs->fs_func->set_entry_type(ep, TYPE_DELETED); + buf_modify(sb, sector); + } +} /* end of fat_delete_dir_entry */ + +void exfat_delete_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 order, s32 num_entries) +{ + int i; + u32 sector; + DENTRY_T *ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + for (i = order; i < num_entries; i++) { + ep = get_entry_in_dir(sb, p_dir, entry+i, §or); + if (!ep) + return; + + p_fs->fs_func->set_entry_type(ep, TYPE_DELETED); + buf_modify(sb, sector); + } +} /* end of exfat_delete_dir_entry */ + +void update_dir_checksum(struct super_block *sb, CHAIN_T *p_dir, s32 entry) +{ + int i, num_entries; + u32 sector; + u16 chksum; + FILE_DENTRY_T *file_ep; + DENTRY_T *ep; + + file_ep = (FILE_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry, §or); + if (!file_ep) + return; + + buf_lock(sb, sector); + + num_entries = (s32) file_ep->num_ext + 1; + chksum = calc_checksum_2byte((void *) file_ep, DENTRY_SIZE, 0, CS_DIR_ENTRY); + + for (i = 1; i < num_entries; i++) { + ep = get_entry_in_dir(sb, p_dir, entry+i, NULL); + if (!ep) { + buf_unlock(sb, sector); + return; + } + + chksum = calc_checksum_2byte((void *) ep, DENTRY_SIZE, chksum, CS_DEFAULT); + } + + SET16_A(file_ep->checksum, chksum); + buf_modify(sb, sector); + buf_unlock(sb, sector); +} /* end of update_dir_checksum */ + +void update_dir_checksum_with_entry_set(struct super_block *sb, ENTRY_SET_CACHE_T *es) +{ + DENTRY_T *ep; + u16 chksum = 0; + s32 chksum_type = CS_DIR_ENTRY, i; + + ep = (DENTRY_T *)&(es->__buf); + for (i = 0; i < es->num_entries; i++) { + DPRINTK("update_dir_checksum_with_entry_set ep %p\n", ep); + chksum = calc_checksum_2byte((void *) ep, DENTRY_SIZE, chksum, chksum_type); + ep++; + chksum_type = CS_DEFAULT; + } + + ep = (DENTRY_T *)&(es->__buf); + SET16_A(((FILE_DENTRY_T *)ep)->checksum, chksum); + write_whole_entry_set(sb, es); +} + +static s32 _walk_fat_chain(struct super_block *sb, CHAIN_T *p_dir, s32 byte_offset, u32 *clu) +{ + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + s32 clu_offset; + u32 cur_clu; + + clu_offset = byte_offset >> p_fs->cluster_size_bits; + cur_clu = p_dir->dir; + + if (p_dir->flags == 0x03) { + cur_clu += clu_offset; + } else { + while (clu_offset > 0) { + if (FAT_read(sb, cur_clu, &cur_clu) == -1) + return FFS_MEDIAERR; + clu_offset--; + } + } + + if (clu) + *clu = cur_clu; + return FFS_SUCCESS; +} +s32 find_location(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 *sector, s32 *offset) +{ + s32 off, ret; + u32 clu = 0; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + off = entry << DENTRY_SIZE_BITS; + + if (p_dir->dir == CLUSTER_32(0)) { /* FAT16 root_dir */ + *offset = off & p_bd->sector_size_mask; + *sector = off >> p_bd->sector_size_bits; + *sector += p_fs->root_start_sector; + } else { + ret = _walk_fat_chain(sb, p_dir, off, &clu); + if (ret != FFS_SUCCESS) + return ret; + + off &= p_fs->cluster_size - 1; /* byte offset in cluster */ + + *offset = off & p_bd->sector_size_mask; /* byte offset in sector */ + *sector = off >> p_bd->sector_size_bits; /* sector offset in cluster */ + *sector += START_SECTOR(clu); + } + return FFS_SUCCESS; +} /* end of find_location */ + +DENTRY_T *get_entry_with_sector(struct super_block *sb, u32 sector, s32 offset) +{ + u8 *buf; + + buf = buf_getblk(sb, sector); + + if (buf == NULL) + return NULL; + + return (DENTRY_T *)(buf + offset); +} /* end of get_entry_with_sector */ + +DENTRY_T *get_entry_in_dir(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 *sector) +{ + s32 off; + u32 sec; + u8 *buf; + + if (find_location(sb, p_dir, entry, &sec, &off) != FFS_SUCCESS) + return NULL; + + buf = buf_getblk(sb, sec); + + if (buf == NULL) + return NULL; + + if (sector != NULL) + *sector = sec; + return (DENTRY_T *)(buf + off); +} /* end of get_entry_in_dir */ + + +/* returns a set of dentries for a file or dir. + * Note that this is a copy (dump) of dentries so that user should call write_entry_set() + * to apply changes made in this entry set to the real device. + * in: + * sb+p_dir+entry: indicates a file/dir + * type: specifies how many dentries should be included. + * out: + * file_ep: will point the first dentry(= file dentry) on success + * return: + * pointer of entry set on success, + * NULL on failure. + */ + +#define ES_MODE_STARTED 0 +#define ES_MODE_GET_FILE_ENTRY 1 +#define ES_MODE_GET_STRM_ENTRY 2 +#define ES_MODE_GET_NAME_ENTRY 3 +#define ES_MODE_GET_CRITICAL_SEC_ENTRY 4 +ENTRY_SET_CACHE_T *get_entry_set_in_dir(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 type, DENTRY_T **file_ep) +{ + s32 off, ret, byte_offset; + u32 clu = 0; + u32 sec, entry_type; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + ENTRY_SET_CACHE_T *es = NULL; + DENTRY_T *ep, *pos; + u8 *buf; + u8 num_entries; + s32 mode = ES_MODE_STARTED; + + DPRINTK("get_entry_set_in_dir entered\n"); + DPRINTK("p_dir dir %u flags %x size %d\n", p_dir->dir, p_dir->flags, p_dir->size); + + byte_offset = entry << DENTRY_SIZE_BITS; + ret = _walk_fat_chain(sb, p_dir, byte_offset, &clu); + if (ret != FFS_SUCCESS) + return NULL; + + + byte_offset &= p_fs->cluster_size - 1; /* byte offset in cluster */ + + off = byte_offset & p_bd->sector_size_mask; /* byte offset in sector */ + sec = byte_offset >> p_bd->sector_size_bits; /* sector offset in cluster */ + sec += START_SECTOR(clu); + + buf = buf_getblk(sb, sec); + if (buf == NULL) + goto err_out; + + + ep = (DENTRY_T *)(buf + off); + entry_type = p_fs->fs_func->get_entry_type(ep); + + if ((entry_type != TYPE_FILE) + && (entry_type != TYPE_DIR)) + goto err_out; + + if (type == ES_ALL_ENTRIES) + num_entries = ((FILE_DENTRY_T *)ep)->num_ext+1; + else + num_entries = type; + + DPRINTK("trying to kmalloc %zx bytes for %d entries\n", offsetof(ENTRY_SET_CACHE_T, __buf) + (num_entries) * sizeof(DENTRY_T), num_entries); + es = kmalloc(offsetof(ENTRY_SET_CACHE_T, __buf) + (num_entries) * sizeof(DENTRY_T), GFP_KERNEL); + if (es == NULL) + goto err_out; + + es->num_entries = num_entries; + es->sector = sec; + es->offset = off; + es->alloc_flag = p_dir->flags; + + pos = (DENTRY_T *) &(es->__buf); + + while(num_entries) { + /* instead of copying whole sector, we will check every entry. + * this will provide minimum stablity and consistancy. + */ + + entry_type = p_fs->fs_func->get_entry_type(ep); + + if ((entry_type == TYPE_UNUSED) || (entry_type == TYPE_DELETED)) + goto err_out; + + switch (mode) { + case ES_MODE_STARTED: + if ((entry_type == TYPE_FILE) || (entry_type == TYPE_DIR)) + mode = ES_MODE_GET_FILE_ENTRY; + else + goto err_out; + break; + case ES_MODE_GET_FILE_ENTRY: + if (entry_type == TYPE_STREAM) + mode = ES_MODE_GET_STRM_ENTRY; + else + goto err_out; + break; + case ES_MODE_GET_STRM_ENTRY: + if (entry_type == TYPE_EXTEND) + mode = ES_MODE_GET_NAME_ENTRY; + else + goto err_out; + break; + case ES_MODE_GET_NAME_ENTRY: + if (entry_type == TYPE_EXTEND) + break; + else if (entry_type == TYPE_STREAM) + goto err_out; + else if (entry_type & TYPE_CRITICAL_SEC) + mode = ES_MODE_GET_CRITICAL_SEC_ENTRY; + else + goto err_out; + break; + case ES_MODE_GET_CRITICAL_SEC_ENTRY: + if ((entry_type == TYPE_EXTEND) || (entry_type == TYPE_STREAM)) + goto err_out; + else if ((entry_type & TYPE_CRITICAL_SEC) != TYPE_CRITICAL_SEC) + goto err_out; + break; + } + + memcpy(pos, ep, sizeof(DENTRY_T)); + + if (--num_entries == 0) + break; + + if (((off + DENTRY_SIZE) & p_bd->sector_size_mask) < (off & p_bd->sector_size_mask)) { + /* get the next sector */ + if (IS_LAST_SECTOR_IN_CLUSTER(sec)) { + if (es->alloc_flag == 0x03) { + clu++; + } else { + if (FAT_read(sb, clu, &clu) == -1) + goto err_out; + } + sec = START_SECTOR(clu); + } else { + sec++; + } + buf = buf_getblk(sb, sec); + if (buf == NULL) + goto err_out; + off = 0; + ep = (DENTRY_T *)(buf); + } else { + ep++; + off += DENTRY_SIZE; + } + pos++; + } + + if (file_ep) + *file_ep = (DENTRY_T *)&(es->__buf); + + DPRINTK("es sec %u offset %d flags %d, num_entries %u buf ptr %p\n", + es->sector, es->offset, es->alloc_flag, es->num_entries, &(es->__buf)); + DPRINTK("get_entry_set_in_dir exited %p\n", es); + return es; +err_out: + DPRINTK("get_entry_set_in_dir exited NULL (es %p)\n", es); + if (es) + kfree(es); + return NULL; +} + +void release_entry_set(ENTRY_SET_CACHE_T *es) +{ + DPRINTK("release_entry_set %p\n", es); + if (es) + kfree(es); +} + + +static s32 __write_partial_entries_in_entry_set(struct super_block *sb, ENTRY_SET_CACHE_T *es, u32 sec, s32 off, u32 count) +{ + s32 num_entries, buf_off = (off - es->offset); + u32 remaining_byte_in_sector, copy_entries; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + u32 clu; + u8 *buf, *esbuf = (u8 *)&(es->__buf); + + DPRINTK("__write_partial_entries_in_entry_set entered\n"); + DPRINTK("es %p sec %u off %d count %d\n", es, sec, off, count); + num_entries = count; + + while (num_entries) { + /* white per sector base */ + remaining_byte_in_sector = (1 << p_bd->sector_size_bits) - off; + copy_entries = MIN(remaining_byte_in_sector >> DENTRY_SIZE_BITS , num_entries); + buf = buf_getblk(sb, sec); + if (buf == NULL) + goto err_out; + DPRINTK("es->buf %p buf_off %u\n", esbuf, buf_off); + DPRINTK("copying %d entries from %p to sector %u\n", copy_entries, (esbuf + buf_off), sec); + memcpy(buf + off, esbuf + buf_off, copy_entries << DENTRY_SIZE_BITS); + buf_modify(sb, sec); + num_entries -= copy_entries; + + if (num_entries) { + /* get next sector */ + if (IS_LAST_SECTOR_IN_CLUSTER(sec)) { + clu = GET_CLUSTER_FROM_SECTOR(sec); + if (es->alloc_flag == 0x03) { + clu++; + } else { + if (FAT_read(sb, clu, &clu) == -1) + goto err_out; + } + sec = START_SECTOR(clu); + } else { + sec++; + } + off = 0; + buf_off += copy_entries << DENTRY_SIZE_BITS; + } + } + + DPRINTK("__write_partial_entries_in_entry_set exited successfully\n"); + return FFS_SUCCESS; +err_out: + DPRINTK("__write_partial_entries_in_entry_set failed\n"); + return FFS_ERROR; +} + +/* write back all entries in entry set */ +s32 write_whole_entry_set(struct super_block *sb, ENTRY_SET_CACHE_T *es) +{ + return __write_partial_entries_in_entry_set(sb, es, es->sector, es->offset, es->num_entries); +} + +/* write back some entries in entry set */ +s32 write_partial_entries_in_entry_set (struct super_block *sb, ENTRY_SET_CACHE_T *es, DENTRY_T *ep, u32 count) +{ + s32 ret, byte_offset, off; + u32 clu=0, sec; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + CHAIN_T dir; + + /* vaidity check */ + if (ep + count > ((DENTRY_T *)&(es->__buf)) + es->num_entries) + return FFS_ERROR; + + dir.dir = GET_CLUSTER_FROM_SECTOR(es->sector); + dir.flags = es->alloc_flag; + dir.size = 0xffffffff; /* XXX */ + + byte_offset = (es->sector - START_SECTOR(dir.dir)) << p_bd->sector_size_bits; + byte_offset += ((void **)ep - &(es->__buf)) + es->offset; + + ret =_walk_fat_chain(sb, &dir, byte_offset, &clu); + if (ret != FFS_SUCCESS) + return ret; + byte_offset &= p_fs->cluster_size - 1; /* byte offset in cluster */ + off = byte_offset & p_bd->sector_size_mask; /* byte offset in sector */ + sec = byte_offset >> p_bd->sector_size_bits; /* sector offset in cluster */ + sec += START_SECTOR(clu); + return __write_partial_entries_in_entry_set(sb, es, sec, off, count); +} + +/* search EMPTY CONTINUOUS "num_entries" entries */ +s32 search_deleted_or_unused_entry(struct super_block *sb, CHAIN_T *p_dir, s32 num_entries) +{ + int i, dentry, num_empty = 0; + s32 dentries_per_clu; + u32 type; + CHAIN_T clu; + DENTRY_T *ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (p_dir->dir == CLUSTER_32(0)) /* FAT16 root_dir */ + dentries_per_clu = p_fs->dentries_in_root; + else + dentries_per_clu = p_fs->dentries_per_clu; + + if (p_fs->hint_uentry.dir == p_dir->dir) { + if (p_fs->hint_uentry.entry == -1) + return -1; + + clu.dir = p_fs->hint_uentry.clu.dir; + clu.size = p_fs->hint_uentry.clu.size; + clu.flags = p_fs->hint_uentry.clu.flags; + + dentry = p_fs->hint_uentry.entry; + } else { + p_fs->hint_uentry.entry = -1; + + clu.dir = p_dir->dir; + clu.size = p_dir->size; + clu.flags = p_dir->flags; + + dentry = 0; + } + + while (clu.dir != CLUSTER_32(~0)) { + if (p_fs->dev_ejected) + break; + + if (p_dir->dir == CLUSTER_32(0)) /* FAT16 root_dir */ + i = dentry % dentries_per_clu; + else + i = dentry & (dentries_per_clu-1); + + for (; i < dentries_per_clu; i++, dentry++) { + ep = get_entry_in_dir(sb, &clu, i, NULL); + if (!ep) + return -1; + + type = p_fs->fs_func->get_entry_type(ep); + + if (type == TYPE_UNUSED) { + num_empty++; + if (p_fs->hint_uentry.entry == -1) { + p_fs->hint_uentry.dir = p_dir->dir; + p_fs->hint_uentry.entry = dentry; + + p_fs->hint_uentry.clu.dir = clu.dir; + p_fs->hint_uentry.clu.size = clu.size; + p_fs->hint_uentry.clu.flags = clu.flags; + } + } else if (type == TYPE_DELETED) { + num_empty++; + } else { + num_empty = 0; + } + + if (num_empty >= num_entries) { + p_fs->hint_uentry.dir = CLUSTER_32(~0); + p_fs->hint_uentry.entry = -1; + + if (p_fs->vol_type == EXFAT) + return dentry - (num_entries-1); + else + return dentry; + } + } + + if (p_dir->dir == CLUSTER_32(0)) + break; /* FAT16 root_dir */ + + if (clu.flags == 0x03) { + if ((--clu.size) > 0) + clu.dir++; + else + clu.dir = CLUSTER_32(~0); + } else { + if (FAT_read(sb, clu.dir, &(clu.dir)) != 0) + return -1; + } + } + + return -1; +} /* end of search_deleted_or_unused_entry */ + +s32 find_empty_entry(struct inode *inode, CHAIN_T *p_dir, s32 num_entries) +{ + s32 ret, dentry; + u32 last_clu, sector; + u64 size = 0; + CHAIN_T clu; + DENTRY_T *ep = NULL; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + FILE_ID_T *fid = &(EXFAT_I(inode)->fid); + + if (p_dir->dir == CLUSTER_32(0)) /* FAT16 root_dir */ + return search_deleted_or_unused_entry(sb, p_dir, num_entries); + + while ((dentry = search_deleted_or_unused_entry(sb, p_dir, num_entries)) < 0) { + if (p_fs->dev_ejected) + break; + + if (p_fs->vol_type == EXFAT) { + if (p_dir->dir != p_fs->root_dir) + size = i_size_read(inode); + } + + last_clu = find_last_cluster(sb, p_dir); + clu.dir = last_clu + 1; + clu.size = 0; + clu.flags = p_dir->flags; + + /* (1) allocate a cluster */ + ret = p_fs->fs_func->alloc_cluster(sb, 1, &clu); + if (ret < 1) + return -1; + + if (clear_cluster(sb, clu.dir) != FFS_SUCCESS) + return -1; + + /* (2) append to the FAT chain */ + if (clu.flags != p_dir->flags) { + exfat_chain_cont_cluster(sb, p_dir->dir, p_dir->size); + p_dir->flags = 0x01; + p_fs->hint_uentry.clu.flags = 0x01; + } + if (clu.flags == 0x01) + if (FAT_write(sb, last_clu, clu.dir) < 0) + return -1; + + if (p_fs->hint_uentry.entry == -1) { + p_fs->hint_uentry.dir = p_dir->dir; + p_fs->hint_uentry.entry = p_dir->size << (p_fs->cluster_size_bits - DENTRY_SIZE_BITS); + + p_fs->hint_uentry.clu.dir = clu.dir; + p_fs->hint_uentry.clu.size = 0; + p_fs->hint_uentry.clu.flags = clu.flags; + } + p_fs->hint_uentry.clu.size++; + p_dir->size++; + + /* (3) update the directory entry */ + if (p_fs->vol_type == EXFAT) { + if (p_dir->dir != p_fs->root_dir) { + size += p_fs->cluster_size; + + ep = get_entry_in_dir(sb, &(fid->dir), fid->entry+1, §or); + if (!ep) + return -1; + p_fs->fs_func->set_entry_size(ep, size); + p_fs->fs_func->set_entry_flag(ep, p_dir->flags); + buf_modify(sb, sector); + + update_dir_checksum(sb, &(fid->dir), fid->entry); + } + } + + i_size_write(inode, i_size_read(inode)+p_fs->cluster_size); + EXFAT_I(inode)->mmu_private += p_fs->cluster_size; + EXFAT_I(inode)->fid.size += p_fs->cluster_size; + EXFAT_I(inode)->fid.flags = p_dir->flags; + inode->i_blocks += 1 << (p_fs->cluster_size_bits - 9); + } + + return dentry; +} /* end of find_empty_entry */ + +/* return values of fat_find_dir_entry() + >= 0 : return dir entiry position with the name in dir + -1 : (root dir, ".") it is the root dir itself + -2 : entry with the name does not exist */ +s32 fat_find_dir_entry(struct super_block *sb, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, s32 num_entries, DOS_NAME_T *p_dosname, u32 type) +{ + int i, dentry = 0, lossy = FALSE, len; + s32 order = 0, is_feasible_entry = TRUE, has_ext_entry = FALSE; + s32 dentries_per_clu; + u32 entry_type; + u16 entry_uniname[14], *uniname = NULL, unichar; + CHAIN_T clu; + DENTRY_T *ep; + DOS_DENTRY_T *dos_ep; + EXT_DENTRY_T *ext_ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (p_dir->dir == p_fs->root_dir) { + if ((!nls_uniname_cmp(sb, p_uniname->name, (u16 *) UNI_CUR_DIR_NAME)) || + (!nls_uniname_cmp(sb, p_uniname->name, (u16 *) UNI_PAR_DIR_NAME))) + return -1; // special case, root directory itself + } + + if (p_dir->dir == CLUSTER_32(0)) /* FAT16 root_dir */ + dentries_per_clu = p_fs->dentries_in_root; + else + dentries_per_clu = p_fs->dentries_per_clu; + + clu.dir = p_dir->dir; + clu.flags = p_dir->flags; + + while (clu.dir != CLUSTER_32(~0)) { + if (p_fs->dev_ejected) + break; + + for (i = 0; i < dentries_per_clu; i++, dentry++) { + ep = get_entry_in_dir(sb, &clu, i, NULL); + if (!ep) + return -2; + + entry_type = p_fs->fs_func->get_entry_type(ep); + + if ((entry_type == TYPE_FILE) || (entry_type == TYPE_DIR)) { + if ((type == TYPE_ALL) || (type == entry_type)) { + if (is_feasible_entry && has_ext_entry) + return dentry; + + dos_ep = (DOS_DENTRY_T *) ep; + if ((!lossy) && (!nls_dosname_cmp(sb, p_dosname->name, dos_ep->name))) + return dentry; + } + is_feasible_entry = TRUE; + has_ext_entry = FALSE; + } else if (entry_type == TYPE_EXTEND) { + if (is_feasible_entry) { + ext_ep = (EXT_DENTRY_T *) ep; + if (ext_ep->order > 0x40) { + order = (s32)(ext_ep->order - 0x40); + uniname = p_uniname->name + 13 * (order-1); + } else { + order = (s32) ext_ep->order; + uniname -= 13; + } + + len = extract_uni_name_from_ext_entry(ext_ep, entry_uniname, order); + + unichar = *(uniname+len); + *(uniname+len) = 0x0; + + if (nls_uniname_cmp(sb, uniname, entry_uniname)) + is_feasible_entry = FALSE; + + *(uniname+len) = unichar; + } + has_ext_entry = TRUE; + } else if (entry_type == TYPE_UNUSED) { + return -2; + } else { + is_feasible_entry = TRUE; + has_ext_entry = FALSE; + } + } + + if (p_dir->dir == CLUSTER_32(0)) + break; /* FAT16 root_dir */ + + if (FAT_read(sb, clu.dir, &(clu.dir)) != 0) + return -2; + } + + return -2; +} /* end of fat_find_dir_entry */ + +/* return values of exfat_find_dir_entry() + >= 0 : return dir entiry position with the name in dir + -1 : (root dir, ".") it is the root dir itself + -2 : entry with the name does not exist */ +s32 exfat_find_dir_entry(struct super_block *sb, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, s32 num_entries, DOS_NAME_T *p_dosname, u32 type) +{ + int i, dentry = 0, num_ext_entries = 0, len; + s32 order = 0, is_feasible_entry = FALSE; + s32 dentries_per_clu, num_empty = 0; + u32 entry_type; + u16 entry_uniname[16], *uniname = NULL, unichar; + CHAIN_T clu; + DENTRY_T *ep; + FILE_DENTRY_T *file_ep; + STRM_DENTRY_T *strm_ep; + NAME_DENTRY_T *name_ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (p_dir->dir == p_fs->root_dir) { + if ((!nls_uniname_cmp(sb, p_uniname->name, (u16 *) UNI_CUR_DIR_NAME)) || + (!nls_uniname_cmp(sb, p_uniname->name, (u16 *) UNI_PAR_DIR_NAME))) + return -1; // special case, root directory itself + } + + if (p_dir->dir == CLUSTER_32(0)) /* FAT16 root_dir */ + dentries_per_clu = p_fs->dentries_in_root; + else + dentries_per_clu = p_fs->dentries_per_clu; + + clu.dir = p_dir->dir; + clu.size = p_dir->size; + clu.flags = p_dir->flags; + + p_fs->hint_uentry.dir = p_dir->dir; + p_fs->hint_uentry.entry = -1; + + while (clu.dir != CLUSTER_32(~0)) { + if (p_fs->dev_ejected) + break; + + for (i = 0; i < dentries_per_clu; i++, dentry++) { + ep = get_entry_in_dir(sb, &clu, i, NULL); + if (!ep) + return -2; + + entry_type = p_fs->fs_func->get_entry_type(ep); + + if ((entry_type == TYPE_UNUSED) || (entry_type == TYPE_DELETED)) { + is_feasible_entry = FALSE; + + if (p_fs->hint_uentry.entry == -1) { + num_empty++; + + if (num_empty == 1) { + p_fs->hint_uentry.clu.dir = clu.dir; + p_fs->hint_uentry.clu.size = clu.size; + p_fs->hint_uentry.clu.flags = clu.flags; + } + if ((num_empty >= num_entries) || (entry_type == TYPE_UNUSED)) + p_fs->hint_uentry.entry = dentry - (num_empty-1); + } + + if (entry_type == TYPE_UNUSED) + return -2; + } else { + num_empty = 0; + + if ((entry_type == TYPE_FILE) || (entry_type == TYPE_DIR)) { + if ((type == TYPE_ALL) || (type == entry_type)) { + file_ep = (FILE_DENTRY_T *) ep; + num_ext_entries = file_ep->num_ext; + is_feasible_entry = TRUE; + } else { + is_feasible_entry = FALSE; + } + } else if (entry_type == TYPE_STREAM) { + if (is_feasible_entry) { + strm_ep = (STRM_DENTRY_T *) ep; + if (p_uniname->name_len == strm_ep->name_len) { + order = 1; + } else { + is_feasible_entry = FALSE; + } + } + } else if (entry_type == TYPE_EXTEND) { + if (is_feasible_entry) { + name_ep = (NAME_DENTRY_T *) ep; + + if ((++order) == 2) + uniname = p_uniname->name; + else + uniname += 15; + + len = extract_uni_name_from_name_entry(name_ep, entry_uniname, order); + + unichar = *(uniname+len); + *(uniname+len) = 0x0; + + if (nls_uniname_cmp(sb, uniname, entry_uniname)) { + is_feasible_entry = FALSE; + } else if (order == num_ext_entries) { + p_fs->hint_uentry.dir = CLUSTER_32(~0); + p_fs->hint_uentry.entry = -1; + return dentry - (num_ext_entries); + } + + *(uniname+len) = unichar; + } + } else { + is_feasible_entry = FALSE; + } + } + } + + if (p_dir->dir == CLUSTER_32(0)) + break; /* FAT16 root_dir */ + + if (clu.flags == 0x03) { + if ((--clu.size) > 0) + clu.dir++; + else + clu.dir = CLUSTER_32(~0); + } else { + if (FAT_read(sb, clu.dir, &(clu.dir)) != 0) + return -2; + } + } + + return -2; +} /* end of exfat_find_dir_entry */ + +/* returns -1 on error */ +s32 fat_count_ext_entries(struct super_block *sb, CHAIN_T *p_dir, s32 entry, DENTRY_T *p_entry) +{ + s32 count = 0; + u8 chksum; + DOS_DENTRY_T *dos_ep = (DOS_DENTRY_T *) p_entry; + EXT_DENTRY_T *ext_ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + chksum = calc_checksum_1byte((void *) dos_ep->name, DOS_NAME_LENGTH, 0); + + for (entry--; entry >= 0; entry--) { + ext_ep = (EXT_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry, NULL); + if (!ext_ep) + return -1; + + if ((p_fs->fs_func->get_entry_type((DENTRY_T *) ext_ep) == TYPE_EXTEND) && + (ext_ep->checksum == chksum)) { + count++; + if (ext_ep->order > 0x40) + return count; + } else { + return count; + } + } + + return count; +} /* end of fat_count_ext_entries */ + +/* returns -1 on error */ +s32 exfat_count_ext_entries(struct super_block *sb, CHAIN_T *p_dir, s32 entry, DENTRY_T *p_entry) +{ + int i, count = 0; + u32 type; + FILE_DENTRY_T *file_ep = (FILE_DENTRY_T *) p_entry; + DENTRY_T *ext_ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + for (i = 0, entry++; i < file_ep->num_ext; i++, entry++) { + ext_ep = get_entry_in_dir(sb, p_dir, entry, NULL); + if (!ext_ep) + return -1; + + type = p_fs->fs_func->get_entry_type(ext_ep); + if ((type == TYPE_EXTEND) || (type == TYPE_STREAM)) + count++; + else + return count; + } + + return count; +} /* end of exfat_count_ext_entries */ + +/* returns -1 on error */ +s32 count_dos_name_entries(struct super_block *sb, CHAIN_T *p_dir, u32 type) +{ + int i, count = 0; + s32 dentries_per_clu; + u32 entry_type; + CHAIN_T clu; + DENTRY_T *ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (p_dir->dir == CLUSTER_32(0)) /* FAT16 root_dir */ + dentries_per_clu = p_fs->dentries_in_root; + else + dentries_per_clu = p_fs->dentries_per_clu; + + clu.dir = p_dir->dir; + clu.size = p_dir->size; + clu.flags = p_dir->flags; + + while (clu.dir != CLUSTER_32(~0)) { + if (p_fs->dev_ejected) + break; + + for (i = 0; i < dentries_per_clu; i++) { + ep = get_entry_in_dir(sb, &clu, i, NULL); + if (!ep) + return -1; + + entry_type = p_fs->fs_func->get_entry_type(ep); + + if (entry_type == TYPE_UNUSED) + return count; + if (!(type & TYPE_CRITICAL_PRI) && !(type & TYPE_BENIGN_PRI)) + continue; + + if ((type == TYPE_ALL) || (type == entry_type)) + count++; + } + + if (p_dir->dir == CLUSTER_32(0)) + break; /* FAT16 root_dir */ + + if (clu.flags == 0x03) { + if ((--clu.size) > 0) + clu.dir++; + else + clu.dir = CLUSTER_32(~0); + } else { + if (FAT_read(sb, clu.dir, &(clu.dir)) != 0) + return -1; + } + } + + return count; +} /* end of count_dos_name_entries */ + +bool is_dir_empty(struct super_block *sb, CHAIN_T *p_dir) +{ + int i, count = 0; + s32 dentries_per_clu; + u32 type; + CHAIN_T clu; + DENTRY_T *ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (p_dir->dir == CLUSTER_32(0)) /* FAT16 root_dir */ + dentries_per_clu = p_fs->dentries_in_root; + else + dentries_per_clu = p_fs->dentries_per_clu; + + clu.dir = p_dir->dir; + clu.size = p_dir->size; + clu.flags = p_dir->flags; + + while (clu.dir != CLUSTER_32(~0)) { + if (p_fs->dev_ejected) + break; + + for (i = 0; i < dentries_per_clu; i++) { + ep = get_entry_in_dir(sb, &clu, i, NULL); + if (!ep) + break; + + type = p_fs->fs_func->get_entry_type(ep); + + if (type == TYPE_UNUSED) + return TRUE; + if ((type != TYPE_FILE) && (type != TYPE_DIR)) + continue; + + if (p_dir->dir == CLUSTER_32(0)) { /* FAT16 root_dir */ + return FALSE; + } else { + if (p_fs->vol_type == EXFAT) + return FALSE; + if ((p_dir->dir == p_fs->root_dir) || ((++count) > 2)) + return FALSE; + } + } + + if (p_dir->dir == CLUSTER_32(0)) + break; /* FAT16 root_dir */ + + if (clu.flags == 0x03) { + if ((--clu.size) > 0) + clu.dir++; + else + clu.dir = CLUSTER_32(~0); + } else { + if (FAT_read(sb, clu.dir, &(clu.dir)) != 0) + break; + } + } + + return TRUE; +} /* end of is_dir_empty */ + +/* + * Name Conversion Functions + */ + +/* input : dir, uni_name + output : num_of_entry, dos_name(format : aaaaaa~1.bbb) */ +s32 get_num_entries_and_dos_name(struct super_block *sb, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, s32 *entries, DOS_NAME_T *p_dosname) +{ + s32 ret, num_entries, lossy = FALSE; + char **r; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + num_entries = p_fs->fs_func->calc_num_entries(p_uniname); + if (num_entries == 0) + return FFS_INVALIDPATH; + + if (p_fs->vol_type != EXFAT) { + nls_uniname_to_dosname(sb, p_dosname, p_uniname, &lossy); + + if (lossy) { + ret = fat_generate_dos_name(sb, p_dir, p_dosname); + if (ret) + return ret; + } else { + for (r = reserved_names; *r; r++) { + if (!strncmp((void *) p_dosname->name, *r, 8)) + return FFS_INVALIDPATH; + } + + if (p_dosname->name_case != 0xFF) + num_entries = 1; + } + + if (num_entries > 1) + p_dosname->name_case = 0x0; + } + + *entries = num_entries; + + return FFS_SUCCESS; +} /* end of get_num_entries_and_dos_name */ + +void get_uni_name_from_dos_entry(struct super_block *sb, DOS_DENTRY_T *ep, UNI_NAME_T *p_uniname, u8 mode) +{ + DOS_NAME_T dos_name; + + if (mode == 0x0) + dos_name.name_case = 0x0; + else + dos_name.name_case = ep->lcase; + + memcpy(dos_name.name, ep->name, DOS_NAME_LENGTH); + nls_dosname_to_uniname(sb, p_uniname, &dos_name); +} /* end of get_uni_name_from_dos_entry */ + +void fat_get_uni_name_from_ext_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u16 *uniname) +{ + int i; + EXT_DENTRY_T *ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + for (entry--, i = 1; entry >= 0; entry--, i++) { + ep = (EXT_DENTRY_T *) get_entry_in_dir(sb, p_dir, entry, NULL); + if (!ep) + return; + + if (p_fs->fs_func->get_entry_type((DENTRY_T *) ep) == TYPE_EXTEND) { + extract_uni_name_from_ext_entry(ep, uniname, i); + if (ep->order > 0x40) + return; + } else { + return; + } + + uniname += 13; + } +} /* end of fat_get_uni_name_from_ext_entry */ + +void exfat_get_uni_name_from_ext_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u16 *uniname) +{ + int i; + DENTRY_T *ep; + ENTRY_SET_CACHE_T *es; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + es = get_entry_set_in_dir(sb, p_dir, entry, ES_ALL_ENTRIES, &ep); + if (es == NULL || es->num_entries < 3) { + if (es) + release_entry_set(es); + return; + } + + ep += 2; + + /* + * First entry : file entry + * Second entry : stream-extension entry + * Third entry : first file-name entry + * So, the index of first file-name dentry should start from 2. + */ + for (i = 2; i < es->num_entries; i++, ep++) { + if (p_fs->fs_func->get_entry_type(ep) == TYPE_EXTEND) + extract_uni_name_from_name_entry((NAME_DENTRY_T *)ep, uniname, i); + else + goto out; + uniname += 15; + } + +out: + release_entry_set(es); +} /* end of exfat_get_uni_name_from_ext_entry */ + +s32 extract_uni_name_from_ext_entry(EXT_DENTRY_T *ep, u16 *uniname, s32 order) +{ + int i, len = 0; + + for (i = 0; i < 10; i += 2) { + *uniname = GET16(ep->unicode_0_4+i); + if (*uniname == 0x0) + return len; + uniname++; + len++; + } + + if (order < 20) { + for (i = 0; i < 12; i += 2) { + *uniname = GET16_A(ep->unicode_5_10+i); + if (*uniname == 0x0) + return len; + uniname++; + len++; + } + } else { + for (i = 0; i < 8; i += 2) { + *uniname = GET16_A(ep->unicode_5_10+i); + if (*uniname == 0x0) + return len; + uniname++; + len++; + } + *uniname = 0x0; /* uniname[MAX_NAME_LENGTH-1] */ + return len; + } + + for (i = 0; i < 4; i += 2) { + *uniname = GET16_A(ep->unicode_11_12+i); + if (*uniname == 0x0) + return len; + uniname++; + len++; + } + + *uniname = 0x0; + return len; + +} /* end of extract_uni_name_from_ext_entry */ + +s32 extract_uni_name_from_name_entry(NAME_DENTRY_T *ep, u16 *uniname, s32 order) +{ + int i, len = 0; + + for (i = 0; i < 30; i += 2) { + *uniname = GET16_A(ep->unicode_0_14+i); + if (*uniname == 0x0) + return len; + uniname++; + len++; + } + + *uniname = 0x0; + return len; + +} /* end of extract_uni_name_from_name_entry */ + +s32 fat_generate_dos_name(struct super_block *sb, CHAIN_T *p_dir, DOS_NAME_T *p_dosname) +{ + int i, j, count = 0, count_begin = FALSE; + s32 dentries_per_clu; + u32 type; + u8 bmap[128/* 1 ~ 1023 */]; + CHAIN_T clu; + DOS_DENTRY_T *ep; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + memset(bmap, 0, sizeof bmap); + exfat_bitmap_set(bmap, 0); + + if (p_dir->dir == CLUSTER_32(0)) /* FAT16 root_dir */ + dentries_per_clu = p_fs->dentries_in_root; + else + dentries_per_clu = p_fs->dentries_per_clu; + + clu.dir = p_dir->dir; + clu.flags = p_dir->flags; + + while (clu.dir != CLUSTER_32(~0)) { + if (p_fs->dev_ejected) + break; + + for (i = 0; i < dentries_per_clu; i++) { + ep = (DOS_DENTRY_T *) get_entry_in_dir(sb, &clu, i, NULL); + if (!ep) + return FFS_MEDIAERR; + + type = p_fs->fs_func->get_entry_type((DENTRY_T *) ep); + + if (type == TYPE_UNUSED) + break; + if ((type != TYPE_FILE) && (type != TYPE_DIR)) + continue; + + count = 0; + count_begin = FALSE; + + for (j = 0; j < 8; j++) { + if (ep->name[j] == ' ') + break; + + if (ep->name[j] == '~') { + count_begin = TRUE; + } else if (count_begin) { + if ((ep->name[j] >= '0') && (ep->name[j] <= '9')) { + count = count * 10 + (ep->name[j] - '0'); + } else { + count = 0; + count_begin = FALSE; + } + } + } + + if ((count > 0) && (count < 1024)) + exfat_bitmap_set(bmap, count); + } + + if (p_dir->dir == CLUSTER_32(0)) + break; /* FAT16 root_dir */ + + if (FAT_read(sb, clu.dir, &(clu.dir)) != 0) + return FFS_MEDIAERR; + } + + count = 0; + for (i = 0; i < 128; i++) { + if (bmap[i] != 0xFF) { + for (j = 0; j < 8; j++) { + if (exfat_bitmap_test(&(bmap[i]), j) == 0) { + count = (i << 3) + j; + break; + } + } + if (count != 0) + break; + } + } + + if ((count == 0) || (count >= 1024)) + return FFS_FILEEXIST; + else + fat_attach_count_to_dos_name(p_dosname->name, count); + + /* Now dos_name has DOS~????.EXT */ + return FFS_SUCCESS; +} /* end of generate_dos_name */ + +void fat_attach_count_to_dos_name(u8 *dosname, s32 count) +{ + int i, j, length; + char str_count[6]; + + snprintf(str_count, sizeof str_count, "~%d", count); + length = strlen(str_count); + + i = j = 0; + while (j <= (8 - length)) { + i = j; + if (dosname[j] == ' ') + break; + if (dosname[j] & 0x80) + j += 2; + else + j++; + } + + for (j = 0; j < length; i++, j++) + dosname[i] = (u8) str_count[j]; + + if (i == 7) + dosname[7] = ' '; + +} /* end of attach_count_to_dos_name */ + +s32 fat_calc_num_entries(UNI_NAME_T *p_uniname) +{ + s32 len; + + len = p_uniname->name_len; + if (len == 0) + return 0; + + /* 1 dos name entry + extended entries */ + return (len-1) / 13 + 2; + +} /* end of calc_num_enties */ + +s32 exfat_calc_num_entries(UNI_NAME_T *p_uniname) +{ + s32 len; + + len = p_uniname->name_len; + if (len == 0) + return 0; + + /* 1 file entry + 1 stream entry + name entries */ + return (len-1) / 15 + 3; + +} /* end of exfat_calc_num_enties */ + +u8 calc_checksum_1byte(void *data, s32 len, u8 chksum) +{ + int i; + u8 *c = (u8 *) data; + + for (i = 0; i < len; i++, c++) + chksum = (((chksum & 1) << 7) | ((chksum & 0xFE) >> 1)) + *c; + + return chksum; +} /* end of calc_checksum_1byte */ + +u16 calc_checksum_2byte(void *data, s32 len, u16 chksum, s32 type) +{ + int i; + u8 *c = (u8 *) data; + + switch (type) { + case CS_DIR_ENTRY: + for (i = 0; i < len; i++, c++) { + if ((i == 2) || (i == 3)) + continue; + chksum = (((chksum & 1) << 15) | ((chksum & 0xFFFE) >> 1)) + (u16) *c; + } + break; + default + : + for (i = 0; i < len; i++, c++) + chksum = (((chksum & 1) << 15) | ((chksum & 0xFFFE) >> 1)) + (u16) *c; + } + + return chksum; +} /* end of calc_checksum_2byte */ + +u32 calc_checksum_4byte(void *data, s32 len, u32 chksum, s32 type) +{ + int i; + u8 *c = (u8 *) data; + + switch (type) { + case CS_PBR_SECTOR: + for (i = 0; i < len; i++, c++) { + if ((i == 106) || (i == 107) || (i == 112)) + continue; + chksum = (((chksum & 1) << 31) | ((chksum & 0xFFFFFFFE) >> 1)) + (u32) *c; + } + break; + default + : + for (i = 0; i < len; i++, c++) + chksum = (((chksum & 1) << 31) | ((chksum & 0xFFFFFFFE) >> 1)) + (u32) *c; + } + + return chksum; +} /* end of calc_checksum_4byte */ + +/* + * Name Resolution Functions + */ + +/* return values of resolve_path() + > 0 : return the length of the path + < 0 : return error */ +s32 resolve_path(struct inode *inode, char *path, CHAIN_T *p_dir, UNI_NAME_T *p_uniname) +{ + s32 lossy = FALSE; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + FILE_ID_T *fid = &(EXFAT_I(inode)->fid); + + if (strlen(path) >= (MAX_NAME_LENGTH * MAX_CHARSET_SIZE)) + return FFS_INVALIDPATH; + + strcpy(name_buf, path); + + nls_cstring_to_uniname(sb, p_uniname, name_buf, &lossy); + if (lossy) + return FFS_INVALIDPATH; + + fid->size = i_size_read(inode); + + p_dir->dir = fid->start_clu; + p_dir->size = (s32)(fid->size >> p_fs->cluster_size_bits); + p_dir->flags = fid->flags; + + return FFS_SUCCESS; +} + +/* + * File Operation Functions + */ +static FS_FUNC_T fat_fs_func = { + .alloc_cluster = fat_alloc_cluster, + .free_cluster = fat_free_cluster, + .count_used_clusters = fat_count_used_clusters, + + .init_dir_entry = fat_init_dir_entry, + .init_ext_entry = fat_init_ext_entry, + .find_dir_entry = fat_find_dir_entry, + .delete_dir_entry = fat_delete_dir_entry, + .get_uni_name_from_ext_entry = fat_get_uni_name_from_ext_entry, + .count_ext_entries = fat_count_ext_entries, + .calc_num_entries = fat_calc_num_entries, + + .get_entry_type = fat_get_entry_type, + .set_entry_type = fat_set_entry_type, + .get_entry_attr = fat_get_entry_attr, + .set_entry_attr = fat_set_entry_attr, + .get_entry_flag = fat_get_entry_flag, + .set_entry_flag = fat_set_entry_flag, + .get_entry_clu0 = fat_get_entry_clu0, + .set_entry_clu0 = fat_set_entry_clu0, + .get_entry_size = fat_get_entry_size, + .set_entry_size = fat_set_entry_size, + .get_entry_time = fat_get_entry_time, + .set_entry_time = fat_set_entry_time, +}; + + +s32 fat16_mount(struct super_block *sb, PBR_SECTOR_T *p_pbr) +{ + s32 num_reserved, num_root_sectors; + BPB16_T *p_bpb = (BPB16_T *) p_pbr->bpb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + if (p_bpb->num_fats == 0) + return FFS_FORMATERR; + + num_root_sectors = GET16(p_bpb->num_root_entries) << DENTRY_SIZE_BITS; + num_root_sectors = ((num_root_sectors-1) >> p_bd->sector_size_bits) + 1; + + p_fs->sectors_per_clu = p_bpb->sectors_per_clu; + p_fs->sectors_per_clu_bits = ilog2(p_bpb->sectors_per_clu); + p_fs->cluster_size_bits = p_fs->sectors_per_clu_bits + p_bd->sector_size_bits; + p_fs->cluster_size = 1 << p_fs->cluster_size_bits; + + p_fs->num_FAT_sectors = GET16(p_bpb->num_fat_sectors); + + p_fs->FAT1_start_sector = p_fs->PBR_sector + GET16(p_bpb->num_reserved); + if (p_bpb->num_fats == 1) + p_fs->FAT2_start_sector = p_fs->FAT1_start_sector; + else + p_fs->FAT2_start_sector = p_fs->FAT1_start_sector + p_fs->num_FAT_sectors; + + p_fs->root_start_sector = p_fs->FAT2_start_sector + p_fs->num_FAT_sectors; + p_fs->data_start_sector = p_fs->root_start_sector + num_root_sectors; + + p_fs->num_sectors = GET16(p_bpb->num_sectors); + if (p_fs->num_sectors == 0) + p_fs->num_sectors = GET32(p_bpb->num_huge_sectors); + + num_reserved = p_fs->data_start_sector - p_fs->PBR_sector; + p_fs->num_clusters = ((p_fs->num_sectors - num_reserved) >> p_fs->sectors_per_clu_bits) + 2; + /* because the cluster index starts with 2 */ + + if (p_fs->num_clusters < FAT12_THRESHOLD) + p_fs->vol_type = FAT12; + else + p_fs->vol_type = FAT16; + p_fs->vol_id = GET32(p_bpb->vol_serial); + + p_fs->root_dir = 0; + p_fs->dentries_in_root = GET16(p_bpb->num_root_entries); + p_fs->dentries_per_clu = 1 << (p_fs->cluster_size_bits - DENTRY_SIZE_BITS); + + p_fs->vol_flag = VOL_CLEAN; + p_fs->clu_srch_ptr = 2; + p_fs->used_clusters = (u32) ~0; + + p_fs->fs_func = &fat_fs_func; + + return FFS_SUCCESS; +} /* end of fat16_mount */ + +s32 fat32_mount(struct super_block *sb, PBR_SECTOR_T *p_pbr) +{ + s32 num_reserved; + BPB32_T *p_bpb = (BPB32_T *) p_pbr->bpb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + if (p_bpb->num_fats == 0) + return FFS_FORMATERR; + + p_fs->sectors_per_clu = p_bpb->sectors_per_clu; + p_fs->sectors_per_clu_bits = ilog2(p_bpb->sectors_per_clu); + p_fs->cluster_size_bits = p_fs->sectors_per_clu_bits + p_bd->sector_size_bits; + p_fs->cluster_size = 1 << p_fs->cluster_size_bits; + + p_fs->num_FAT_sectors = GET32(p_bpb->num_fat32_sectors); + + p_fs->FAT1_start_sector = p_fs->PBR_sector + GET16(p_bpb->num_reserved); + if (p_bpb->num_fats == 1) + p_fs->FAT2_start_sector = p_fs->FAT1_start_sector; + else + p_fs->FAT2_start_sector = p_fs->FAT1_start_sector + p_fs->num_FAT_sectors; + + p_fs->root_start_sector = p_fs->FAT2_start_sector + p_fs->num_FAT_sectors; + p_fs->data_start_sector = p_fs->root_start_sector; + + p_fs->num_sectors = GET32(p_bpb->num_huge_sectors); + num_reserved = p_fs->data_start_sector - p_fs->PBR_sector; + + p_fs->num_clusters = ((p_fs->num_sectors-num_reserved) >> p_fs->sectors_per_clu_bits) + 2; + /* because the cluster index starts with 2 */ + + p_fs->vol_type = FAT32; + p_fs->vol_id = GET32(p_bpb->vol_serial); + + p_fs->root_dir = GET32(p_bpb->root_cluster); + p_fs->dentries_in_root = 0; + p_fs->dentries_per_clu = 1 << (p_fs->cluster_size_bits - DENTRY_SIZE_BITS); + + p_fs->vol_flag = VOL_CLEAN; + p_fs->clu_srch_ptr = 2; + p_fs->used_clusters = (u32) ~0; + + p_fs->fs_func = &fat_fs_func; + + return FFS_SUCCESS; +} /* end of fat32_mount */ + +static FS_FUNC_T exfat_fs_func = { + .alloc_cluster = exfat_alloc_cluster, + .free_cluster = exfat_free_cluster, + .count_used_clusters = exfat_count_used_clusters, + + .init_dir_entry = exfat_init_dir_entry, + .init_ext_entry = exfat_init_ext_entry, + .find_dir_entry = exfat_find_dir_entry, + .delete_dir_entry = exfat_delete_dir_entry, + .get_uni_name_from_ext_entry = exfat_get_uni_name_from_ext_entry, + .count_ext_entries = exfat_count_ext_entries, + .calc_num_entries = exfat_calc_num_entries, + + .get_entry_type = exfat_get_entry_type, + .set_entry_type = exfat_set_entry_type, + .get_entry_attr = exfat_get_entry_attr, + .set_entry_attr = exfat_set_entry_attr, + .get_entry_flag = exfat_get_entry_flag, + .set_entry_flag = exfat_set_entry_flag, + .get_entry_clu0 = exfat_get_entry_clu0, + .set_entry_clu0 = exfat_set_entry_clu0, + .get_entry_size = exfat_get_entry_size, + .set_entry_size = exfat_set_entry_size, + .get_entry_time = exfat_get_entry_time, + .set_entry_time = exfat_set_entry_time, +}; + +s32 exfat_mount(struct super_block *sb, PBR_SECTOR_T *p_pbr) +{ + BPBEX_T *p_bpb = (BPBEX_T *) p_pbr->bpb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + + if (p_bpb->num_fats == 0) + return FFS_FORMATERR; + + p_fs->sectors_per_clu = 1 << p_bpb->sectors_per_clu_bits; + p_fs->sectors_per_clu_bits = p_bpb->sectors_per_clu_bits; + p_fs->cluster_size_bits = p_fs->sectors_per_clu_bits + p_bd->sector_size_bits; + p_fs->cluster_size = 1 << p_fs->cluster_size_bits; + + p_fs->num_FAT_sectors = GET32(p_bpb->fat_length); + + p_fs->FAT1_start_sector = p_fs->PBR_sector + GET32(p_bpb->fat_offset); + if (p_bpb->num_fats == 1) + p_fs->FAT2_start_sector = p_fs->FAT1_start_sector; + else + p_fs->FAT2_start_sector = p_fs->FAT1_start_sector + p_fs->num_FAT_sectors; + + p_fs->root_start_sector = p_fs->PBR_sector + GET32(p_bpb->clu_offset); + p_fs->data_start_sector = p_fs->root_start_sector; + + p_fs->num_sectors = GET64(p_bpb->vol_length); + p_fs->num_clusters = GET32(p_bpb->clu_count) + 2; + /* because the cluster index starts with 2 */ + + p_fs->vol_type = EXFAT; + p_fs->vol_id = GET32(p_bpb->vol_serial); + + p_fs->root_dir = GET32(p_bpb->root_cluster); + p_fs->dentries_in_root = 0; + p_fs->dentries_per_clu = 1 << (p_fs->cluster_size_bits - DENTRY_SIZE_BITS); + + p_fs->vol_flag = (u32) GET16(p_bpb->vol_flags); + p_fs->clu_srch_ptr = 2; + p_fs->used_clusters = (u32) ~0; + + p_fs->fs_func = &exfat_fs_func; + + return FFS_SUCCESS; +} /* end of exfat_mount */ + +s32 create_dir(struct inode *inode, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, FILE_ID_T *fid) +{ + s32 ret, dentry, num_entries; + u64 size; + CHAIN_T clu; + DOS_NAME_T dos_name, dot_name; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + ret = get_num_entries_and_dos_name(sb, p_dir, p_uniname, &num_entries, &dos_name); + if (ret) + return ret; + + /* find_empty_entry must be called before alloc_cluster */ + dentry = find_empty_entry(inode, p_dir, num_entries); + if (dentry < 0) + return FFS_FULL; + + clu.dir = CLUSTER_32(~0); + clu.size = 0; + clu.flags = (p_fs->vol_type == EXFAT) ? 0x03 : 0x01; + + /* (1) allocate a cluster */ + ret = p_fs->fs_func->alloc_cluster(sb, 1, &clu); + if (ret < 0) + return FFS_MEDIAERR; + else if (ret == 0) + return FFS_FULL; + + ret = clear_cluster(sb, clu.dir); + if (ret != FFS_SUCCESS) + return ret; + + if (p_fs->vol_type == EXFAT) { + size = p_fs->cluster_size; + } else { + size = 0; + + /* initialize the . and .. entry + Information for . points to itself + Information for .. points to parent dir */ + + dot_name.name_case = 0x0; + memcpy(dot_name.name, DOS_CUR_DIR_NAME, DOS_NAME_LENGTH); + + ret = p_fs->fs_func->init_dir_entry(sb, &clu, 0, TYPE_DIR, clu.dir, 0); + if (ret != FFS_SUCCESS) + return ret; + + ret = p_fs->fs_func->init_ext_entry(sb, &clu, 0, 1, NULL, &dot_name); + if (ret != FFS_SUCCESS) + return ret; + + memcpy(dot_name.name, DOS_PAR_DIR_NAME, DOS_NAME_LENGTH); + + if (p_dir->dir == p_fs->root_dir) + ret = p_fs->fs_func->init_dir_entry(sb, &clu, 1, TYPE_DIR, CLUSTER_32(0), 0); + else + ret = p_fs->fs_func->init_dir_entry(sb, &clu, 1, TYPE_DIR, p_dir->dir, 0); + + if (ret != FFS_SUCCESS) + return ret; + + ret = p_fs->fs_func->init_ext_entry(sb, &clu, 1, 1, NULL, &dot_name); + if (ret != FFS_SUCCESS) + return ret; + } + + /* (2) update the directory entry */ + /* make sub-dir entry in parent directory */ + ret = p_fs->fs_func->init_dir_entry(sb, p_dir, dentry, TYPE_DIR, clu.dir, size); + if (ret != FFS_SUCCESS) + return ret; + + ret = p_fs->fs_func->init_ext_entry(sb, p_dir, dentry, num_entries, p_uniname, &dos_name); + if (ret != FFS_SUCCESS) + return ret; + + fid->dir.dir = p_dir->dir; + fid->dir.size = p_dir->size; + fid->dir.flags = p_dir->flags; + fid->entry = dentry; + + fid->attr = ATTR_SUBDIR; + fid->flags = (p_fs->vol_type == EXFAT) ? 0x03 : 0x01; + fid->size = size; + fid->start_clu = clu.dir; + + fid->type = TYPE_DIR; + fid->rwoffset = 0; + fid->hint_last_off = -1; + + return FFS_SUCCESS; +} /* end of create_dir */ + +s32 create_file(struct inode *inode, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, u8 mode, FILE_ID_T *fid) +{ + s32 ret, dentry, num_entries; + DOS_NAME_T dos_name; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + ret = get_num_entries_and_dos_name(sb, p_dir, p_uniname, &num_entries, &dos_name); + if (ret) + return ret; + + /* find_empty_entry must be called before alloc_cluster() */ + dentry = find_empty_entry(inode, p_dir, num_entries); + if (dentry < 0) + return FFS_FULL; + + /* (1) update the directory entry */ + /* fill the dos name directory entry information of the created file. + the first cluster is not determined yet. (0) */ + ret = p_fs->fs_func->init_dir_entry(sb, p_dir, dentry, TYPE_FILE | mode, CLUSTER_32(0), 0); + if (ret != FFS_SUCCESS) + return ret; + + ret = p_fs->fs_func->init_ext_entry(sb, p_dir, dentry, num_entries, p_uniname, &dos_name); + if (ret != FFS_SUCCESS) + return ret; + + fid->dir.dir = p_dir->dir; + fid->dir.size = p_dir->size; + fid->dir.flags = p_dir->flags; + fid->entry = dentry; + + fid->attr = ATTR_ARCHIVE | mode; + fid->flags = (p_fs->vol_type == EXFAT) ? 0x03 : 0x01; + fid->size = 0; + fid->start_clu = CLUSTER_32(~0); + + fid->type = TYPE_FILE; + fid->rwoffset = 0; + fid->hint_last_off = -1; + + return FFS_SUCCESS; +} /* end of create_file */ + +void remove_file(struct inode *inode, CHAIN_T *p_dir, s32 entry) +{ + s32 num_entries; + u32 sector; + DENTRY_T *ep; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + ep = get_entry_in_dir(sb, p_dir, entry, §or); + if (!ep) + return; + + buf_lock(sb, sector); + + /* buf_lock() before call count_ext_entries() */ + num_entries = p_fs->fs_func->count_ext_entries(sb, p_dir, entry, ep); + if (num_entries < 0) { + buf_unlock(sb, sector); + return; + } + num_entries++; + + buf_unlock(sb, sector); + + /* (1) update the directory entry */ + p_fs->fs_func->delete_dir_entry(sb, p_dir, entry, 0, num_entries); +} /* end of remove_file */ + +s32 rename_file(struct inode *inode, CHAIN_T *p_dir, s32 oldentry, UNI_NAME_T *p_uniname, FILE_ID_T *fid) +{ + s32 ret, newentry = -1, num_old_entries, num_new_entries; + u32 sector_old, sector_new; + DOS_NAME_T dos_name; + DENTRY_T *epold, *epnew; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + epold = get_entry_in_dir(sb, p_dir, oldentry, §or_old); + if (!epold) + return FFS_MEDIAERR; + + buf_lock(sb, sector_old); + + /* buf_lock() before call count_ext_entries() */ + num_old_entries = p_fs->fs_func->count_ext_entries(sb, p_dir, oldentry, epold); + if (num_old_entries < 0) { + buf_unlock(sb, sector_old); + return FFS_MEDIAERR; + } + num_old_entries++; + + ret = get_num_entries_and_dos_name(sb, p_dir, p_uniname, &num_new_entries, &dos_name); + if (ret) { + buf_unlock(sb, sector_old); + return ret; + } + + if (num_old_entries < num_new_entries) { + newentry = find_empty_entry(inode, p_dir, num_new_entries); + if (newentry < 0) { + buf_unlock(sb, sector_old); + return FFS_FULL; + } + + epnew = get_entry_in_dir(sb, p_dir, newentry, §or_new); + if (!epnew) { + buf_unlock(sb, sector_old); + return FFS_MEDIAERR; + } + + memcpy((void *) epnew, (void *) epold, DENTRY_SIZE); + if (p_fs->fs_func->get_entry_type(epnew) == TYPE_FILE) { + p_fs->fs_func->set_entry_attr(epnew, p_fs->fs_func->get_entry_attr(epnew) | ATTR_ARCHIVE); + fid->attr |= ATTR_ARCHIVE; + } + buf_modify(sb, sector_new); + buf_unlock(sb, sector_old); + + if (p_fs->vol_type == EXFAT) { + epold = get_entry_in_dir(sb, p_dir, oldentry+1, §or_old); + buf_lock(sb, sector_old); + epnew = get_entry_in_dir(sb, p_dir, newentry+1, §or_new); + + if (!epold || !epnew) { + buf_unlock(sb, sector_old); + return FFS_MEDIAERR; + } + + memcpy((void *) epnew, (void *) epold, DENTRY_SIZE); + buf_modify(sb, sector_new); + buf_unlock(sb, sector_old); + } + + ret = p_fs->fs_func->init_ext_entry(sb, p_dir, newentry, num_new_entries, p_uniname, &dos_name); + if (ret != FFS_SUCCESS) + return ret; + + p_fs->fs_func->delete_dir_entry(sb, p_dir, oldentry, 0, num_old_entries); + fid->entry = newentry; + } else { + if (p_fs->fs_func->get_entry_type(epold) == TYPE_FILE) { + p_fs->fs_func->set_entry_attr(epold, p_fs->fs_func->get_entry_attr(epold) | ATTR_ARCHIVE); + fid->attr |= ATTR_ARCHIVE; + } + buf_modify(sb, sector_old); + buf_unlock(sb, sector_old); + + ret = p_fs->fs_func->init_ext_entry(sb, p_dir, oldentry, num_new_entries, p_uniname, &dos_name); + if (ret != FFS_SUCCESS) + return ret; + + p_fs->fs_func->delete_dir_entry(sb, p_dir, oldentry, num_new_entries, num_old_entries); + } + + return FFS_SUCCESS; +} /* end of rename_file */ + +s32 move_file(struct inode *inode, CHAIN_T *p_olddir, s32 oldentry, CHAIN_T *p_newdir, UNI_NAME_T *p_uniname, FILE_ID_T *fid) +{ + s32 ret, newentry, num_new_entries, num_old_entries; + u32 sector_mov, sector_new; + CHAIN_T clu; + DOS_NAME_T dos_name; + DENTRY_T *epmov, *epnew; + struct super_block *sb = inode->i_sb; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + epmov = get_entry_in_dir(sb, p_olddir, oldentry, §or_mov); + if (!epmov) + return FFS_MEDIAERR; + + /* check if the source and target directory is the same */ + if (p_fs->fs_func->get_entry_type(epmov) == TYPE_DIR && + p_fs->fs_func->get_entry_clu0(epmov) == p_newdir->dir) + return FFS_INVALIDPATH; + + buf_lock(sb, sector_mov); + + /* buf_lock() before call count_ext_entries() */ + num_old_entries = p_fs->fs_func->count_ext_entries(sb, p_olddir, oldentry, epmov); + if (num_old_entries < 0) { + buf_unlock(sb, sector_mov); + return FFS_MEDIAERR; + } + num_old_entries++; + + ret = get_num_entries_and_dos_name(sb, p_newdir, p_uniname, &num_new_entries, &dos_name); + if (ret) { + buf_unlock(sb, sector_mov); + return ret; + } + + newentry = find_empty_entry(inode, p_newdir, num_new_entries); + if (newentry < 0) { + buf_unlock(sb, sector_mov); + return FFS_FULL; + } + + epnew = get_entry_in_dir(sb, p_newdir, newentry, §or_new); + if (!epnew) { + buf_unlock(sb, sector_mov); + return FFS_MEDIAERR; + } + + memcpy((void *) epnew, (void *) epmov, DENTRY_SIZE); + if (p_fs->fs_func->get_entry_type(epnew) == TYPE_FILE) { + p_fs->fs_func->set_entry_attr(epnew, p_fs->fs_func->get_entry_attr(epnew) | ATTR_ARCHIVE); + fid->attr |= ATTR_ARCHIVE; + } + buf_modify(sb, sector_new); + buf_unlock(sb, sector_mov); + + if (p_fs->vol_type == EXFAT) { + epmov = get_entry_in_dir(sb, p_olddir, oldentry+1, §or_mov); + buf_lock(sb, sector_mov); + epnew = get_entry_in_dir(sb, p_newdir, newentry+1, §or_new); + if (!epmov || !epnew) { + buf_unlock(sb, sector_mov); + return FFS_MEDIAERR; + } + + memcpy((void *) epnew, (void *) epmov, DENTRY_SIZE); + buf_modify(sb, sector_new); + buf_unlock(sb, sector_mov); + } else if (p_fs->fs_func->get_entry_type(epnew) == TYPE_DIR) { + /* change ".." pointer to new parent dir */ + clu.dir = p_fs->fs_func->get_entry_clu0(epnew); + clu.flags = 0x01; + + epnew = get_entry_in_dir(sb, &clu, 1, §or_new); + if (!epnew) + return FFS_MEDIAERR; + + if (p_newdir->dir == p_fs->root_dir) + p_fs->fs_func->set_entry_clu0(epnew, CLUSTER_32(0)); + else + p_fs->fs_func->set_entry_clu0(epnew, p_newdir->dir); + buf_modify(sb, sector_new); + } + + ret = p_fs->fs_func->init_ext_entry(sb, p_newdir, newentry, num_new_entries, p_uniname, &dos_name); + if (ret != FFS_SUCCESS) + return ret; + + p_fs->fs_func->delete_dir_entry(sb, p_olddir, oldentry, 0, num_old_entries); + + fid->dir.dir = p_newdir->dir; + fid->dir.size = p_newdir->size; + fid->dir.flags = p_newdir->flags; + + fid->entry = newentry; + + return FFS_SUCCESS; +} /* end of move_file */ + +/* + * Sector Read/Write Functions + */ + +s32 sector_read(struct super_block *sb, u32 sec, struct buffer_head **bh, s32 read) +{ + s32 ret = FFS_MEDIAERR; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if ((sec >= (p_fs->PBR_sector+p_fs->num_sectors)) && (p_fs->num_sectors > 0)) { + printk("[EXFAT] sector_read: out of range error! (sec = %d)\n", sec); + fs_error(sb); + return ret; + } + + if (!p_fs->dev_ejected) { + ret = bdev_read(sb, sec, bh, 1, read); + if (ret != FFS_SUCCESS) + p_fs->dev_ejected = TRUE; + } + + return ret; +} /* end of sector_read */ + +s32 sector_write(struct super_block *sb, u32 sec, struct buffer_head *bh, s32 sync) +{ + s32 ret = FFS_MEDIAERR; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (sec >= (p_fs->PBR_sector+p_fs->num_sectors) && (p_fs->num_sectors > 0)) { + printk("[EXFAT] sector_write: out of range error! (sec = %d)\n", sec); + fs_error(sb); + return ret; + } + + if (bh == NULL) { + printk("[EXFAT] sector_write: bh is NULL!\n"); + fs_error(sb); + return ret; + } + + if (!p_fs->dev_ejected) { + ret = bdev_write(sb, sec, bh, 1, sync); + if (ret != FFS_SUCCESS) + p_fs->dev_ejected = TRUE; + } + + return ret; +} /* end of sector_write */ + +s32 multi_sector_read(struct super_block *sb, u32 sec, struct buffer_head **bh, s32 num_secs, s32 read) +{ + s32 ret = FFS_MEDIAERR; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (((sec+num_secs) > (p_fs->PBR_sector+p_fs->num_sectors)) && (p_fs->num_sectors > 0)) { + printk("[EXFAT] multi_sector_read: out of range error! (sec = %d, num_secs = %d)\n", sec, num_secs); + fs_error(sb); + return ret; + } + + if (!p_fs->dev_ejected) { + ret = bdev_read(sb, sec, bh, num_secs, read); + if (ret != FFS_SUCCESS) + p_fs->dev_ejected = TRUE; + } + + return ret; +} /* end of multi_sector_read */ + +s32 multi_sector_write(struct super_block *sb, u32 sec, struct buffer_head *bh, s32 num_secs, s32 sync) +{ + s32 ret = FFS_MEDIAERR; + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if ((sec+num_secs) > (p_fs->PBR_sector+p_fs->num_sectors) && (p_fs->num_sectors > 0)) { + printk("[EXFAT] multi_sector_write: out of range error! (sec = %d, num_secs = %d)\n", sec, num_secs); + fs_error(sb); + return ret; + } + if (bh == NULL) { + printk("[EXFAT] multi_sector_write: bh is NULL!\n"); + fs_error(sb); + return ret; + } + + if (!p_fs->dev_ejected) { + ret = bdev_write(sb, sec, bh, num_secs, sync); + if (ret != FFS_SUCCESS) + p_fs->dev_ejected = TRUE; + } + + return ret; +} /* end of multi_sector_write */ diff --git b/fs/exfat/exfat_core.h b/fs/exfat/exfat_core.h new file mode 100644 index 0000000..4bdfe4e --- /dev/null +++ b/fs/exfat/exfat_core.h @@ -0,0 +1,671 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_core.h */ +/* PURPOSE : Header File for exFAT File Manager */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#ifndef _EXFAT_H +#define _EXFAT_H + +#include "exfat_config.h" +#include "exfat_data.h" +#include "exfat_oal.h" + +#include "exfat_blkdev.h" +#include "exfat_cache.h" +#include "exfat_nls.h" +#include "exfat_api.h" +#include "exfat_cache.h" + +#ifdef CONFIG_EXFAT_KERNEL_DEBUG + /* For Debugging Purpose */ + /* IOCTL code 'f' used by + * - file systems typically #0~0x1F + * - embedded terminal devices #128~ + * - exts for debugging purpose #99 + * number 100 and 101 is availble now but has possible conflicts + */ +#define EXFAT_IOC_GET_DEBUGFLAGS _IOR('f', 100, long) +#define EXFAT_IOC_SET_DEBUGFLAGS _IOW('f', 101, long) + +#define EXFAT_DEBUGFLAGS_INVALID_UMOUNT 0x01 +#define EXFAT_DEBUGFLAGS_ERROR_RW 0x02 +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ + + /*----------------------------------------------------------------------*/ + /* Constant & Macro Definitions */ + /*----------------------------------------------------------------------*/ + +#define DENTRY_SIZE 32 /* dir entry size */ +#define DENTRY_SIZE_BITS 5 + +/* PBR entries */ +#define PBR_SIGNATURE 0xAA55 +#define EXT_SIGNATURE 0xAA550000 +#define VOL_LABEL "NO NAME " /* size should be 11 */ +#define OEM_NAME "MSWIN4.1" /* size should be 8 */ +#define STR_FAT12 "FAT12 " /* size should be 8 */ +#define STR_FAT16 "FAT16 " /* size should be 8 */ +#define STR_FAT32 "FAT32 " /* size should be 8 */ +#define STR_EXFAT "EXFAT " /* size should be 8 */ +#define VOL_CLEAN 0x0000 +#define VOL_DIRTY 0x0002 + +/* max number of clusters */ +#define FAT12_THRESHOLD 4087 /* 2^12 - 1 + 2 (clu 0 & 1) */ +#define FAT16_THRESHOLD 65527 /* 2^16 - 1 + 2 */ +#define FAT32_THRESHOLD 268435457 /* 2^28 - 1 + 2 */ +#define EXFAT_THRESHOLD 268435457 /* 2^28 - 1 + 2 */ + +/* file types */ +#define TYPE_UNUSED 0x0000 +#define TYPE_DELETED 0x0001 +#define TYPE_INVALID 0x0002 +#define TYPE_CRITICAL_PRI 0x0100 +#define TYPE_BITMAP 0x0101 +#define TYPE_UPCASE 0x0102 +#define TYPE_VOLUME 0x0103 +#define TYPE_DIR 0x0104 +#define TYPE_FILE 0x011F +#define TYPE_SYMLINK 0x015F +#define TYPE_CRITICAL_SEC 0x0200 +#define TYPE_STREAM 0x0201 +#define TYPE_EXTEND 0x0202 +#define TYPE_ACL 0x0203 +#define TYPE_BENIGN_PRI 0x0400 +#define TYPE_GUID 0x0401 +#define TYPE_PADDING 0x0402 +#define TYPE_ACLTAB 0x0403 +#define TYPE_BENIGN_SEC 0x0800 +#define TYPE_ALL 0x0FFF + +/* time modes */ +#define TM_CREATE 0 +#define TM_MODIFY 1 +#define TM_ACCESS 2 + +/* checksum types */ +#define CS_DIR_ENTRY 0 +#define CS_PBR_SECTOR 1 +#define CS_DEFAULT 2 + +#define CLUSTER_16(x) ((u16)(x)) +#define CLUSTER_32(x) ((u32)(x)) + +#define FALSE 0 +#define TRUE 1 + +#define MIN(a, b) (((a) < (b)) ? (a) : (b)) +#define MAX(a, b) (((a) > (b)) ? (a) : (b)) + +#define START_SECTOR(x) \ + ((((x) - 2) << p_fs->sectors_per_clu_bits) + p_fs->data_start_sector) + +#define IS_LAST_SECTOR_IN_CLUSTER(sec) \ + ((((sec) - p_fs->data_start_sector + 1) & ((1 << p_fs->sectors_per_clu_bits) - 1)) == 0) + +#define GET_CLUSTER_FROM_SECTOR(sec) \ + ((((sec) - p_fs->data_start_sector) >> p_fs->sectors_per_clu_bits) + 2) + +#define GET16(p_src) \ + (((u16)(p_src)[0]) | (((u16)(p_src)[1]) << 8)) +#define GET32(p_src) \ + (((u32)(p_src)[0]) | (((u32)(p_src)[1]) << 8) | \ + (((u32)(p_src)[2]) << 16) | (((u32)(p_src)[3]) << 24)) +#define GET64(p_src) \ + (((u64)(p_src)[0]) | (((u64)(p_src)[1]) << 8) | \ + (((u64)(p_src)[2]) << 16) | (((u64)(p_src)[3]) << 24) | \ + (((u64)(p_src)[4]) << 32) | (((u64)(p_src)[5]) << 40) | \ + (((u64)(p_src)[6]) << 48) | (((u64)(p_src)[7]) << 56)) + + +#define SET16(p_dst, src) \ + do { \ + (p_dst)[0] = (u8)(src); \ + (p_dst)[1] = (u8)(((u16)(src)) >> 8); \ + } while (0) +#define SET32(p_dst, src) \ + do { \ + (p_dst)[0] = (u8)(src); \ + (p_dst)[1] = (u8)(((u32)(src)) >> 8); \ + (p_dst)[2] = (u8)(((u32)(src)) >> 16); \ + (p_dst)[3] = (u8)(((u32)(src)) >> 24); \ + } while (0) +#define SET64(p_dst, src) \ + do { \ + (p_dst)[0] = (u8)(src); \ + (p_dst)[1] = (u8)(((u64)(src)) >> 8); \ + (p_dst)[2] = (u8)(((u64)(src)) >> 16); \ + (p_dst)[3] = (u8)(((u64)(src)) >> 24); \ + (p_dst)[4] = (u8)(((u64)(src)) >> 32); \ + (p_dst)[5] = (u8)(((u64)(src)) >> 40); \ + (p_dst)[6] = (u8)(((u64)(src)) >> 48); \ + (p_dst)[7] = (u8)(((u64)(src)) >> 56); \ + } while (0) + +#ifdef __LITTLE_ENDIAN +#define GET16_A(p_src) (*((u16 *)(p_src))) +#define GET32_A(p_src) (*((u32 *)(p_src))) +#define GET64_A(p_src) (*((u64 *)(p_src))) +#define SET16_A(p_dst, src) (*((u16 *)(p_dst)) = (u16)(src)) +#define SET32_A(p_dst, src) (*((u32 *)(p_dst)) = (u32)(src)) +#define SET64_A(p_dst, src) (*((u64 *)(p_dst)) = (u64)(src)) +#else /* BIG_ENDIAN */ +#define GET16_A(p_src) GET16(p_src) +#define GET32_A(p_src) GET32(p_src) +#define GET64_A(p_src) GET64(p_src) +#define SET16_A(p_dst, src) SET16(p_dst, src) +#define SET32_A(p_dst, src) SET32(p_dst, src) +#define SET64_A(p_dst, src) SET64(p_dst, src) +#endif + +/* Upcase tabel mecro */ +#define HIGH_INDEX_BIT (8) +#define HIGH_INDEX_MASK (0xFF00) +#define LOW_INDEX_BIT (16-HIGH_INDEX_BIT) +#define UTBL_ROW_COUNT (1<> LOW_INDEX_BIT; +} +static inline u16 get_row_index(u16 i) +{ + return i & ~HIGH_INDEX_MASK; +} +/*----------------------------------------------------------------------*/ +/* Type Definitions */ +/*----------------------------------------------------------------------*/ + +/* MS_DOS FAT partition boot record (512 bytes) */ +typedef struct { + u8 jmp_boot[3]; + u8 oem_name[8]; + u8 bpb[109]; + u8 boot_code[390]; + u8 signature[2]; +} PBR_SECTOR_T; + +/* MS-DOS FAT12/16 BIOS parameter block (51 bytes) */ +typedef struct { + u8 sector_size[2]; + u8 sectors_per_clu; + u8 num_reserved[2]; + u8 num_fats; + u8 num_root_entries[2]; + u8 num_sectors[2]; + u8 media_type; + u8 num_fat_sectors[2]; + u8 sectors_in_track[2]; + u8 num_heads[2]; + u8 num_hid_sectors[4]; + u8 num_huge_sectors[4]; + + u8 phy_drv_no; + u8 reserved; + u8 ext_signature; + u8 vol_serial[4]; + u8 vol_label[11]; + u8 vol_type[8]; +} BPB16_T; + +/* MS-DOS FAT32 BIOS parameter block (79 bytes) */ +typedef struct { + u8 sector_size[2]; + u8 sectors_per_clu; + u8 num_reserved[2]; + u8 num_fats; + u8 num_root_entries[2]; + u8 num_sectors[2]; + u8 media_type; + u8 num_fat_sectors[2]; + u8 sectors_in_track[2]; + u8 num_heads[2]; + u8 num_hid_sectors[4]; + u8 num_huge_sectors[4]; + u8 num_fat32_sectors[4]; + u8 ext_flags[2]; + u8 fs_version[2]; + u8 root_cluster[4]; + u8 fsinfo_sector[2]; + u8 backup_sector[2]; + u8 reserved[12]; + + u8 phy_drv_no; + u8 ext_reserved; + u8 ext_signature; + u8 vol_serial[4]; + u8 vol_label[11]; + u8 vol_type[8]; +} BPB32_T; + +/* MS-DOS EXFAT BIOS parameter block (109 bytes) */ +typedef struct { + u8 reserved1[53]; + u8 vol_offset[8]; + u8 vol_length[8]; + u8 fat_offset[4]; + u8 fat_length[4]; + u8 clu_offset[4]; + u8 clu_count[4]; + u8 root_cluster[4]; + u8 vol_serial[4]; + u8 fs_version[2]; + u8 vol_flags[2]; + u8 sector_size_bits; + u8 sectors_per_clu_bits; + u8 num_fats; + u8 phy_drv_no; + u8 perc_in_use; + u8 reserved2[7]; +} BPBEX_T; + +/* MS-DOS FAT file system information sector (512 bytes) */ +typedef struct { + u8 signature1[4]; + u8 reserved1[480]; + u8 signature2[4]; + u8 free_cluster[4]; + u8 next_cluster[4]; + u8 reserved2[14]; + u8 signature3[2]; +} FSI_SECTOR_T; + +/* MS-DOS FAT directory entry (32 bytes) */ +typedef struct { + u8 dummy[32]; +} DENTRY_T; + +typedef struct { + u8 name[DOS_NAME_LENGTH]; + u8 attr; + u8 lcase; + u8 create_time_ms; + u8 create_time[2]; + u8 create_date[2]; + u8 access_date[2]; + u8 start_clu_hi[2]; + u8 modify_time[2]; + u8 modify_date[2]; + u8 start_clu_lo[2]; + u8 size[4]; +} DOS_DENTRY_T; + +/* MS-DOS FAT extended directory entry (32 bytes) */ +typedef struct { + u8 order; + u8 unicode_0_4[10]; + u8 attr; + u8 sysid; + u8 checksum; + u8 unicode_5_10[12]; + u8 start_clu[2]; + u8 unicode_11_12[4]; +} EXT_DENTRY_T; + +/* MS-DOS EXFAT file directory entry (32 bytes) */ +typedef struct { + u8 type; + u8 num_ext; + u8 checksum[2]; + u8 attr[2]; + u8 reserved1[2]; + u8 create_time[2]; + u8 create_date[2]; + u8 modify_time[2]; + u8 modify_date[2]; + u8 access_time[2]; + u8 access_date[2]; + u8 create_time_ms; + u8 modify_time_ms; + u8 access_time_ms; + u8 reserved2[9]; +} FILE_DENTRY_T; + +/* MS-DOS EXFAT stream extension directory entry (32 bytes) */ +typedef struct { + u8 type; + u8 flags; + u8 reserved1; + u8 name_len; + u8 name_hash[2]; + u8 reserved2[2]; + u8 valid_size[8]; + u8 reserved3[4]; + u8 start_clu[4]; + u8 size[8]; +} STRM_DENTRY_T; + +/* MS-DOS EXFAT file name directory entry (32 bytes) */ +typedef struct { + u8 type; + u8 flags; + u8 unicode_0_14[30]; +} NAME_DENTRY_T; + +/* MS-DOS EXFAT allocation bitmap directory entry (32 bytes) */ +typedef struct { + u8 type; + u8 flags; + u8 reserved[18]; + u8 start_clu[4]; + u8 size[8]; +} BMAP_DENTRY_T; + +/* MS-DOS EXFAT up-case table directory entry (32 bytes) */ +typedef struct { + u8 type; + u8 reserved1[3]; + u8 checksum[4]; + u8 reserved2[12]; + u8 start_clu[4]; + u8 size[8]; +} CASE_DENTRY_T; + +/* MS-DOS EXFAT volume label directory entry (32 bytes) */ +typedef struct { + u8 type; + u8 label_len; + u8 unicode_0_10[22]; + u8 reserved[8]; +} VOLM_DENTRY_T; + +/* unused entry hint information */ +typedef struct { + u32 dir; + s32 entry; + CHAIN_T clu; +} UENTRY_T; + +typedef struct { + s32 (*alloc_cluster)(struct super_block *sb, s32 num_alloc, CHAIN_T *p_chain); + void (*free_cluster)(struct super_block *sb, CHAIN_T *p_chain, s32 do_relse); + s32 (*count_used_clusters)(struct super_block *sb); + + s32 (*init_dir_entry)(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 type, + u32 start_clu, u64 size); + s32 (*init_ext_entry)(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 num_entries, + UNI_NAME_T *p_uniname, DOS_NAME_T *p_dosname); + s32 (*find_dir_entry)(struct super_block *sb, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, s32 num_entries, DOS_NAME_T *p_dosname, u32 type); + void (*delete_dir_entry)(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 offset, s32 num_entries); + void (*get_uni_name_from_ext_entry)(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u16 *uniname); + s32 (*count_ext_entries)(struct super_block *sb, CHAIN_T *p_dir, s32 entry, DENTRY_T *p_entry); + s32 (*calc_num_entries)(UNI_NAME_T *p_uniname); + + u32 (*get_entry_type)(DENTRY_T *p_entry); + void (*set_entry_type)(DENTRY_T *p_entry, u32 type); + u32 (*get_entry_attr)(DENTRY_T *p_entry); + void (*set_entry_attr)(DENTRY_T *p_entry, u32 attr); + u8 (*get_entry_flag)(DENTRY_T *p_entry); + void (*set_entry_flag)(DENTRY_T *p_entry, u8 flag); + u32 (*get_entry_clu0)(DENTRY_T *p_entry); + void (*set_entry_clu0)(DENTRY_T *p_entry, u32 clu0); + u64 (*get_entry_size)(DENTRY_T *p_entry); + void (*set_entry_size)(DENTRY_T *p_entry, u64 size); + void (*get_entry_time)(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode); + void (*set_entry_time)(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode); +} FS_FUNC_T; + +typedef struct __FS_INFO_T { + u32 drv; /* drive ID */ + u32 vol_type; /* volume FAT type */ + u32 vol_id; /* volume serial number */ + + u32 num_sectors; /* num of sectors in volume */ + u32 num_clusters; /* num of clusters in volume */ + u32 cluster_size; /* cluster size in bytes */ + u32 cluster_size_bits; + u32 sectors_per_clu; /* cluster size in sectors */ + u32 sectors_per_clu_bits; + + u32 PBR_sector; /* PBR sector */ + u32 FAT1_start_sector; /* FAT1 start sector */ + u32 FAT2_start_sector; /* FAT2 start sector */ + u32 root_start_sector; /* root dir start sector */ + u32 data_start_sector; /* data area start sector */ + u32 num_FAT_sectors; /* num of FAT sectors */ + + u32 root_dir; /* root dir cluster */ + u32 dentries_in_root; /* num of dentries in root dir */ + u32 dentries_per_clu; /* num of dentries per cluster */ + + u32 vol_flag; /* volume dirty flag */ + struct buffer_head *pbr_bh; /* PBR sector */ + + u32 map_clu; /* allocation bitmap start cluster */ + u32 map_sectors; /* num of allocation bitmap sectors */ + struct buffer_head **vol_amap; /* allocation bitmap */ + + u16 **vol_utbl; /* upcase table */ + + u32 clu_srch_ptr; /* cluster search pointer */ + u32 used_clusters; /* number of used clusters */ + UENTRY_T hint_uentry; /* unused entry hint information */ + + u32 dev_ejected; /* block device operation error flag */ + + FS_FUNC_T *fs_func; + struct semaphore v_sem; + + /* FAT cache */ + BUF_CACHE_T FAT_cache_array[FAT_CACHE_SIZE]; + BUF_CACHE_T FAT_cache_lru_list; + BUF_CACHE_T FAT_cache_hash_list[FAT_CACHE_HASH_SIZE]; + + /* buf cache */ + BUF_CACHE_T buf_cache_array[BUF_CACHE_SIZE]; + BUF_CACHE_T buf_cache_lru_list; + BUF_CACHE_T buf_cache_hash_list[BUF_CACHE_HASH_SIZE]; +} FS_INFO_T; + +#define ES_2_ENTRIES 2 +#define ES_3_ENTRIES 3 +#define ES_ALL_ENTRIES 0 + +typedef struct { + u32 sector; /* sector number that contains file_entry */ + s32 offset; /* byte offset in the sector */ + s32 alloc_flag; /* flag in stream entry. 01 for cluster chain, 03 for contig. clusteres. */ + u32 num_entries; + + /* __buf should be the last member */ + void *__buf; +} ENTRY_SET_CACHE_T; + +/*----------------------------------------------------------------------*/ +/* External Function Declarations */ +/*----------------------------------------------------------------------*/ + +/* file system initialization & shutdown functions */ +s32 ffsInit(void); +s32 ffsShutdown(void); + +/* volume management functions */ +s32 ffsMountVol(struct super_block *sb); +s32 ffsUmountVol(struct super_block *sb); +s32 ffsCheckVol(struct super_block *sb); +s32 ffsGetVolInfo(struct super_block *sb, VOL_INFO_T *info); +s32 ffsSyncVol(struct super_block *sb, s32 do_sync); + +/* file management functions */ +s32 ffsLookupFile(struct inode *inode, char *path, FILE_ID_T *fid); +s32 ffsCreateFile(struct inode *inode, char *path, u8 mode, FILE_ID_T *fid); +s32 ffsReadFile(struct inode *inode, FILE_ID_T *fid, void *buffer, u64 count, u64 *rcount); +s32 ffsWriteFile(struct inode *inode, FILE_ID_T *fid, void *buffer, u64 count, u64 *wcount); +s32 ffsTruncateFile(struct inode *inode, u64 old_size, u64 new_size); +s32 ffsMoveFile(struct inode *old_parent_inode, FILE_ID_T *fid, struct inode *new_parent_inode, struct dentry *new_dentry); +s32 ffsRemoveFile(struct inode *inode, FILE_ID_T *fid); +s32 ffsSetAttr(struct inode *inode, u32 attr); +s32 ffsGetStat(struct inode *inode, DIR_ENTRY_T *info); +s32 ffsSetStat(struct inode *inode, DIR_ENTRY_T *info); +s32 ffsMapCluster(struct inode *inode, s32 clu_offset, u32 *clu); + +/* directory management functions */ +s32 ffsCreateDir(struct inode *inode, char *path, FILE_ID_T *fid); +s32 ffsReadDir(struct inode *inode, DIR_ENTRY_T *dir_ent); +s32 ffsRemoveDir(struct inode *inode, FILE_ID_T *fid); + +/*----------------------------------------------------------------------*/ +/* External Function Declarations (NOT TO UPPER LAYER) */ +/*----------------------------------------------------------------------*/ + +/* fs management functions */ +s32 fs_init(void); +s32 fs_shutdown(void); +void fs_set_vol_flags(struct super_block *sb, u32 new_flag); +void fs_sync(struct super_block *sb, s32 do_sync); +void fs_error(struct super_block *sb); + +/* cluster management functions */ +s32 clear_cluster(struct super_block *sb, u32 clu); +s32 fat_alloc_cluster(struct super_block *sb, s32 num_alloc, CHAIN_T *p_chain); +s32 exfat_alloc_cluster(struct super_block *sb, s32 num_alloc, CHAIN_T *p_chain); +void fat_free_cluster(struct super_block *sb, CHAIN_T *p_chain, s32 do_relse); +void exfat_free_cluster(struct super_block *sb, CHAIN_T *p_chain, s32 do_relse); +u32 find_last_cluster(struct super_block *sb, CHAIN_T *p_chain); +s32 count_num_clusters(struct super_block *sb, CHAIN_T *dir); +s32 fat_count_used_clusters(struct super_block *sb); +s32 exfat_count_used_clusters(struct super_block *sb); +void exfat_chain_cont_cluster(struct super_block *sb, u32 chain, s32 len); + +/* allocation bitmap management functions */ +s32 load_alloc_bitmap(struct super_block *sb); +void free_alloc_bitmap(struct super_block *sb); +s32 set_alloc_bitmap(struct super_block *sb, u32 clu); +s32 clr_alloc_bitmap(struct super_block *sb, u32 clu); +u32 test_alloc_bitmap(struct super_block *sb, u32 clu); +void sync_alloc_bitmap(struct super_block *sb); + +/* upcase table management functions */ +s32 load_upcase_table(struct super_block *sb); +void free_upcase_table(struct super_block *sb); + +/* dir entry management functions */ +u32 fat_get_entry_type(DENTRY_T *p_entry); +u32 exfat_get_entry_type(DENTRY_T *p_entry); +void fat_set_entry_type(DENTRY_T *p_entry, u32 type); +void exfat_set_entry_type(DENTRY_T *p_entry, u32 type); +u32 fat_get_entry_attr(DENTRY_T *p_entry); +u32 exfat_get_entry_attr(DENTRY_T *p_entry); +void fat_set_entry_attr(DENTRY_T *p_entry, u32 attr); +void exfat_set_entry_attr(DENTRY_T *p_entry, u32 attr); +u8 fat_get_entry_flag(DENTRY_T *p_entry); +u8 exfat_get_entry_flag(DENTRY_T *p_entry); +void fat_set_entry_flag(DENTRY_T *p_entry, u8 flag); +void exfat_set_entry_flag(DENTRY_T *p_entry, u8 flag); +u32 fat_get_entry_clu0(DENTRY_T *p_entry); +u32 exfat_get_entry_clu0(DENTRY_T *p_entry); +void fat_set_entry_clu0(DENTRY_T *p_entry, u32 start_clu); +void exfat_set_entry_clu0(DENTRY_T *p_entry, u32 start_clu); +u64 fat_get_entry_size(DENTRY_T *p_entry); +u64 exfat_get_entry_size(DENTRY_T *p_entry); +void fat_set_entry_size(DENTRY_T *p_entry, u64 size); +void exfat_set_entry_size(DENTRY_T *p_entry, u64 size); +void fat_get_entry_time(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode); +void exfat_get_entry_time(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode); +void fat_set_entry_time(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode); +void exfat_set_entry_time(DENTRY_T *p_entry, TIMESTAMP_T *tp, u8 mode); +s32 fat_init_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 type, u32 start_clu, u64 size); +s32 exfat_init_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 type, u32 start_clu, u64 size); +s32 fat_init_ext_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 num_entries, UNI_NAME_T *p_uniname, DOS_NAME_T *p_dosname); +s32 exfat_init_ext_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 num_entries, UNI_NAME_T *p_uniname, DOS_NAME_T *p_dosname); +void init_dos_entry(DOS_DENTRY_T *ep, u32 type, u32 start_clu); +void init_ext_entry(EXT_DENTRY_T *ep, s32 order, u8 chksum, u16 *uniname); +void init_file_entry(FILE_DENTRY_T *ep, u32 type); +void init_strm_entry(STRM_DENTRY_T *ep, u8 flags, u32 start_clu, u64 size); +void init_name_entry(NAME_DENTRY_T *ep, u16 *uniname); +void fat_delete_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 order, s32 num_entries); +void exfat_delete_dir_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, s32 order, s32 num_entries); + +s32 find_location(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 *sector, s32 *offset); +DENTRY_T *get_entry_with_sector(struct super_block *sb, u32 sector, s32 offset); +DENTRY_T *get_entry_in_dir(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 *sector); +ENTRY_SET_CACHE_T *get_entry_set_in_dir(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u32 type, DENTRY_T **file_ep); +void release_entry_set(ENTRY_SET_CACHE_T *es); +s32 write_whole_entry_set(struct super_block *sb, ENTRY_SET_CACHE_T *es); +s32 write_partial_entries_in_entry_set(struct super_block *sb, ENTRY_SET_CACHE_T *es, DENTRY_T *ep, u32 count); +s32 search_deleted_or_unused_entry(struct super_block *sb, CHAIN_T *p_dir, s32 num_entries); +s32 find_empty_entry(struct inode *inode, CHAIN_T *p_dir, s32 num_entries); +s32 fat_find_dir_entry(struct super_block *sb, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, s32 num_entries, DOS_NAME_T *p_dosname, u32 type); +s32 exfat_find_dir_entry(struct super_block *sb, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, s32 num_entries, DOS_NAME_T *p_dosname, u32 type); +s32 fat_count_ext_entries(struct super_block *sb, CHAIN_T *p_dir, s32 entry, DENTRY_T *p_entry); +s32 exfat_count_ext_entries(struct super_block *sb, CHAIN_T *p_dir, s32 entry, DENTRY_T *p_entry); +s32 count_dos_name_entries(struct super_block *sb, CHAIN_T *p_dir, u32 type); +void update_dir_checksum(struct super_block *sb, CHAIN_T *p_dir, s32 entry); +void update_dir_checksum_with_entry_set(struct super_block *sb, ENTRY_SET_CACHE_T *es); +bool is_dir_empty(struct super_block *sb, CHAIN_T *p_dir); + +/* name conversion functions */ +s32 get_num_entries_and_dos_name(struct super_block *sb, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, s32 *entries, DOS_NAME_T *p_dosname); +void get_uni_name_from_dos_entry(struct super_block *sb, DOS_DENTRY_T *ep, UNI_NAME_T *p_uniname, u8 mode); +void fat_get_uni_name_from_ext_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u16 *uniname); +void exfat_get_uni_name_from_ext_entry(struct super_block *sb, CHAIN_T *p_dir, s32 entry, u16 *uniname); +s32 extract_uni_name_from_ext_entry(EXT_DENTRY_T *ep, u16 *uniname, s32 order); +s32 extract_uni_name_from_name_entry(NAME_DENTRY_T *ep, u16 *uniname, s32 order); +s32 fat_generate_dos_name(struct super_block *sb, CHAIN_T *p_dir, DOS_NAME_T *p_dosname); +void fat_attach_count_to_dos_name(u8 *dosname, s32 count); +s32 fat_calc_num_entries(UNI_NAME_T *p_uniname); +s32 exfat_calc_num_entries(UNI_NAME_T *p_uniname); +u8 calc_checksum_1byte(void *data, s32 len, u8 chksum); +u16 calc_checksum_2byte(void *data, s32 len, u16 chksum, s32 type); +u32 calc_checksum_4byte(void *data, s32 len, u32 chksum, s32 type); + +/* name resolution functions */ +s32 resolve_path(struct inode *inode, char *path, CHAIN_T *p_dir, UNI_NAME_T *p_uniname); +s32 resolve_name(u8 *name, u8 **arg); + +/* file operation functions */ +s32 fat16_mount(struct super_block *sb, PBR_SECTOR_T *p_pbr); +s32 fat32_mount(struct super_block *sb, PBR_SECTOR_T *p_pbr); +s32 exfat_mount(struct super_block *sb, PBR_SECTOR_T *p_pbr); +s32 create_dir(struct inode *inode, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, FILE_ID_T *fid); +s32 create_file(struct inode *inode, CHAIN_T *p_dir, UNI_NAME_T *p_uniname, u8 mode, FILE_ID_T *fid); +void remove_file(struct inode *inode, CHAIN_T *p_dir, s32 entry); +s32 rename_file(struct inode *inode, CHAIN_T *p_dir, s32 old_entry, UNI_NAME_T *p_uniname, FILE_ID_T *fid); +s32 move_file(struct inode *inode, CHAIN_T *p_olddir, s32 oldentry, CHAIN_T *p_newdir, UNI_NAME_T *p_uniname, FILE_ID_T *fid); + +/* sector read/write functions */ +s32 sector_read(struct super_block *sb, u32 sec, struct buffer_head **bh, s32 read); +s32 sector_write(struct super_block *sb, u32 sec, struct buffer_head *bh, s32 sync); +s32 multi_sector_read(struct super_block *sb, u32 sec, struct buffer_head **bh, s32 num_secs, s32 read); +s32 multi_sector_write(struct super_block *sb, u32 sec, struct buffer_head *bh, s32 num_secs, s32 sync); + +#endif /* _EXFAT_H */ diff --git b/fs/exfat/exfat_data.c b/fs/exfat/exfat_data.c new file mode 100644 index 0000000..65da07a --- /dev/null +++ b/fs/exfat/exfat_data.c @@ -0,0 +1,77 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_data.c */ +/* PURPOSE : exFAT Configuable Data Definitions */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#include "exfat_config.h" +#include "exfat_data.h" +#include "exfat_oal.h" + +#include "exfat_blkdev.h" +#include "exfat_cache.h" +#include "exfat_nls.h" +#include "exfat_super.h" +#include "exfat_core.h" + +/*======================================================================*/ +/* */ +/* GLOBAL VARIABLE DEFINITIONS */ +/* */ +/*======================================================================*/ + +/*----------------------------------------------------------------------*/ +/* File Manager */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* Buffer Manager */ +/*----------------------------------------------------------------------*/ + +/* FAT cache */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) +DECLARE_MUTEX(f_sem); +#else +DEFINE_SEMAPHORE(f_sem); +#endif +BUF_CACHE_T FAT_cache_array[FAT_CACHE_SIZE]; +BUF_CACHE_T FAT_cache_lru_list; +BUF_CACHE_T FAT_cache_hash_list[FAT_CACHE_HASH_SIZE]; + +/* buf cache */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) +DECLARE_MUTEX(b_sem); +#else +DEFINE_SEMAPHORE(b_sem); +#endif +BUF_CACHE_T buf_cache_array[BUF_CACHE_SIZE]; +BUF_CACHE_T buf_cache_lru_list; +BUF_CACHE_T buf_cache_hash_list[BUF_CACHE_HASH_SIZE]; diff --git b/fs/exfat/exfat_data.h b/fs/exfat/exfat_data.h new file mode 100644 index 0000000..53b0e39 --- /dev/null +++ b/fs/exfat/exfat_data.h @@ -0,0 +1,58 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_data.h */ +/* PURPOSE : Header File for exFAT Configuable Constants */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#ifndef _EXFAT_DATA_H +#define _EXFAT_DATA_H + +#include "exfat_config.h" + +/*======================================================================*/ +/* */ +/* FFS CONFIGURATIONS */ +/* (CHANGE THIS PART IF REQUIRED) */ +/* */ +/*======================================================================*/ + +/* max number of root directory entries in FAT12/16 */ +/* (should be an exponential value of 2) */ +#define MAX_DENTRY 512 + +/* cache size (in number of sectors) */ +/* (should be an exponential value of 2) */ +#define FAT_CACHE_SIZE 128 +#define FAT_CACHE_HASH_SIZE 64 +#define BUF_CACHE_SIZE 256 +#define BUF_CACHE_HASH_SIZE 64 + +#endif /* _EXFAT_DATA_H */ diff --git b/fs/exfat/exfat_nls.c b/fs/exfat/exfat_nls.c new file mode 100644 index 0000000..a48b3d0 --- /dev/null +++ b/fs/exfat/exfat_nls.c @@ -0,0 +1,448 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_nls.c */ +/* PURPOSE : exFAT NLS Manager */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#include "exfat_config.h" +#include "exfat_data.h" + +#include "exfat_nls.h" +#include "exfat_api.h" +#include "exfat_super.h" +#include "exfat_core.h" + +#include + +/*----------------------------------------------------------------------*/ +/* Global Variable Definitions */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* Local Variable Definitions */ +/*----------------------------------------------------------------------*/ + +static u16 bad_dos_chars[] = { + /* + , ; = [ ] */ + 0x002B, 0x002C, 0x003B, 0x003D, 0x005B, 0x005D, + 0xFF0B, 0xFF0C, 0xFF1B, 0xFF1D, 0xFF3B, 0xFF3D, + 0 +}; + +static u16 bad_uni_chars[] = { + /* " * / : < > ? \ | */ + 0x0022, 0x002A, 0x002F, 0x003A, + 0x003C, 0x003E, 0x003F, 0x005C, 0x007C, + 0 +}; + +/*----------------------------------------------------------------------*/ +/* Local Function Declarations */ +/*----------------------------------------------------------------------*/ + +static s32 convert_uni_to_ch(struct nls_table *nls, u8 *ch, u16 uni, s32 *lossy); +static s32 convert_ch_to_uni(struct nls_table *nls, u16 *uni, u8 *ch, s32 *lossy); + +/*======================================================================*/ +/* Global Function Definitions */ +/*======================================================================*/ + +u16 nls_upper(struct super_block *sb, u16 a) +{ + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + + if (EXFAT_SB(sb)->options.casesensitive) + return a; + if (p_fs->vol_utbl != NULL && (p_fs->vol_utbl)[get_col_index(a)] != NULL) + return (p_fs->vol_utbl)[get_col_index(a)][get_row_index(a)]; + else + return a; +} + +u16 *nls_wstrchr(u16 *str, u16 wchar) +{ + while (*str) { + if (*(str++) == wchar) + return str; + } + + return 0; +} + +s32 nls_dosname_cmp(struct super_block *sb, u8 *a, u8 *b) +{ + return strncmp((void *) a, (void *) b, DOS_NAME_LENGTH); +} /* end of nls_dosname_cmp */ + +s32 nls_uniname_cmp(struct super_block *sb, u16 *a, u16 *b) +{ + int i; + + for (i = 0; i < MAX_NAME_LENGTH; i++, a++, b++) { + if (nls_upper(sb, *a) != nls_upper(sb, *b)) + return 1; + if (*a == 0x0) + return 0; + } + return 0; +} /* end of nls_uniname_cmp */ + +void nls_uniname_to_dosname(struct super_block *sb, DOS_NAME_T *p_dosname, UNI_NAME_T *p_uniname, s32 *p_lossy) +{ + int i, j, len, lossy = FALSE; + u8 buf[MAX_CHARSET_SIZE]; + u8 lower = 0, upper = 0; + u8 *dosname = p_dosname->name; + u16 *uniname = p_uniname->name; + u16 *p, *last_period; + struct nls_table *nls = EXFAT_SB(sb)->nls_disk; + + for (i = 0; i < DOS_NAME_LENGTH; i++) + *(dosname+i) = ' '; + + if (!nls_uniname_cmp(sb, uniname, (u16 *) UNI_CUR_DIR_NAME)) { + *(dosname) = '.'; + p_dosname->name_case = 0x0; + if (p_lossy != NULL) + *p_lossy = FALSE; + return; + } + + if (!nls_uniname_cmp(sb, uniname, (u16 *) UNI_PAR_DIR_NAME)) { + *(dosname) = '.'; + *(dosname+1) = '.'; + p_dosname->name_case = 0x0; + if (p_lossy != NULL) + *p_lossy = FALSE; + return; + } + + /* search for the last embedded period */ + last_period = NULL; + for (p = uniname; *p; p++) { + if (*p == (u16) '.') + last_period = p; + } + + i = 0; + while (i < DOS_NAME_LENGTH) { + if (i == 8) { + if (last_period == NULL) + break; + + if (uniname <= last_period) { + if (uniname < last_period) + lossy = TRUE; + uniname = last_period + 1; + } + } + + if (*uniname == (u16) '\0') { + break; + } else if (*uniname == (u16) ' ') { + lossy = TRUE; + } else if (*uniname == (u16) '.') { + if (uniname < last_period) + lossy = TRUE; + else + i = 8; + } else if (nls_wstrchr(bad_dos_chars, *uniname)) { + lossy = TRUE; + *(dosname+i) = '_'; + i++; + } else { + len = convert_uni_to_ch(nls, buf, *uniname, &lossy); + + if (len > 1) { + if ((i >= 8) && ((i+len) > DOS_NAME_LENGTH)) + break; + + if ((i < 8) && ((i+len) > 8)) { + i = 8; + continue; + } + + lower = 0xFF; + + for (j = 0; j < len; j++, i++) + *(dosname+i) = *(buf+j); + } else { /* len == 1 */ + if ((*buf >= 'a') && (*buf <= 'z')) { + *(dosname+i) = *buf - ('a' - 'A'); + + if (i < 8) + lower |= 0x08; + else + lower |= 0x10; + } else if ((*buf >= 'A') && (*buf <= 'Z')) { + *(dosname+i) = *buf; + + if (i < 8) + upper |= 0x08; + else + upper |= 0x10; + } else { + *(dosname+i) = *buf; + } + i++; + } + } + + uniname++; + } + + if (*dosname == 0xE5) + *dosname = 0x05; + + if (*uniname != 0x0) + lossy = TRUE; + + if (upper & lower) + p_dosname->name_case = 0xFF; + else + p_dosname->name_case = lower; + + if (p_lossy != NULL) + *p_lossy = lossy; +} /* end of nls_uniname_to_dosname */ + +void nls_dosname_to_uniname(struct super_block *sb, UNI_NAME_T *p_uniname, DOS_NAME_T *p_dosname) +{ + int i = 0, j, n = 0; + u8 buf[DOS_NAME_LENGTH+2]; + u8 *dosname = p_dosname->name; + u16 *uniname = p_uniname->name; + struct nls_table *nls = EXFAT_SB(sb)->nls_disk; + + if (*dosname == 0x05) { + *buf = 0xE5; + i++; + n++; + } + + for (; i < 8; i++, n++) { + if (*(dosname+i) == ' ') + break; + + if ((*(dosname+i) >= 'A') && (*(dosname+i) <= 'Z') && (p_dosname->name_case & 0x08)) + *(buf+n) = *(dosname+i) + ('a' - 'A'); + else + *(buf+n) = *(dosname+i); + } + if (*(dosname+8) != ' ') { + *(buf+n) = '.'; + n++; + } + + for (i = 8; i < DOS_NAME_LENGTH; i++, n++) { + if (*(dosname+i) == ' ') + break; + + if ((*(dosname+i) >= 'A') && (*(dosname+i) <= 'Z') && (p_dosname->name_case & 0x10)) + *(buf+n) = *(dosname+i) + ('a' - 'A'); + else + *(buf+n) = *(dosname+i); + } + *(buf+n) = '\0'; + + i = j = 0; + while (j < (MAX_NAME_LENGTH-1)) { + if (*(buf+i) == '\0') + break; + + i += convert_ch_to_uni(nls, uniname, (buf+i), NULL); + + uniname++; + j++; + } + + *uniname = (u16) '\0'; +} /* end of nls_dosname_to_uniname */ + +void nls_uniname_to_cstring(struct super_block *sb, u8 *p_cstring, UNI_NAME_T *p_uniname) +{ + int i, j, len; + u8 buf[MAX_CHARSET_SIZE]; + u16 *uniname = p_uniname->name; + struct nls_table *nls = EXFAT_SB(sb)->nls_io; + + if (nls == NULL) { + len = utf16s_to_utf8s(uniname, MAX_NAME_LENGTH, UTF16_HOST_ENDIAN, p_cstring, MAX_NAME_LENGTH); + p_cstring[len] = 0; + return; + } + + i = 0; + while (i < (MAX_NAME_LENGTH-1)) { + if (*uniname == (u16) '\0') + break; + + len = convert_uni_to_ch(nls, buf, *uniname, NULL); + + if (len > 1) { + for (j = 0; j < len; j++) + *p_cstring++ = (char) *(buf+j); + } else { /* len == 1 */ + *p_cstring++ = (char) *buf; + } + + uniname++; + i++; + } + + *p_cstring = '\0'; +} /* end of nls_uniname_to_cstring */ + +void nls_cstring_to_uniname(struct super_block *sb, UNI_NAME_T *p_uniname, u8 *p_cstring, s32 *p_lossy) +{ + int i, j, lossy = FALSE; + u8 *end_of_name; + u8 upname[MAX_NAME_LENGTH * 2]; + u16 *uniname = p_uniname->name; + struct nls_table *nls = EXFAT_SB(sb)->nls_io; + + + /* strip all trailing spaces */ + end_of_name = p_cstring + strlen((char *) p_cstring); + + while (*(--end_of_name) == ' ') { + if (end_of_name < p_cstring) + break; + } + *(++end_of_name) = '\0'; + + if (strcmp((char *) p_cstring, ".") && strcmp((char *) p_cstring, "..")) { + + /* strip all trailing periods */ + while (*(--end_of_name) == '.') { + if (end_of_name < p_cstring) + break; + } + *(++end_of_name) = '\0'; + } + + if (*p_cstring == '\0') + lossy = TRUE; + + if (nls == NULL) { +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,0,101) + i = utf8s_to_utf16s(p_cstring, MAX_NAME_LENGTH, uniname); +#else + i = utf8s_to_utf16s(p_cstring, MAX_NAME_LENGTH, UTF16_HOST_ENDIAN, uniname, MAX_NAME_LENGTH); +#endif + for (j = 0; j < i; j++) + SET16_A(upname + j * 2, nls_upper(sb, uniname[j])); + uniname[i] = '\0'; + } + else { + i = j = 0; + while (j < (MAX_NAME_LENGTH-1)) { + if (*(p_cstring+i) == '\0') + break; + + i += convert_ch_to_uni(nls, uniname, (u8 *)(p_cstring+i), &lossy); + + if ((*uniname < 0x0020) || nls_wstrchr(bad_uni_chars, *uniname)) + lossy = TRUE; + + SET16_A(upname + j * 2, nls_upper(sb, *uniname)); + + uniname++; + j++; + } + + if (*(p_cstring+i) != '\0') + lossy = TRUE; + *uniname = (u16) '\0'; + } + + p_uniname->name_len = j; + p_uniname->name_hash = calc_checksum_2byte((void *) upname, j<<1, 0, CS_DEFAULT); + + if (p_lossy != NULL) + *p_lossy = lossy; +} /* end of nls_cstring_to_uniname */ + +/*======================================================================*/ +/* Local Function Definitions */ +/*======================================================================*/ + +static s32 convert_ch_to_uni(struct nls_table *nls, u16 *uni, u8 *ch, s32 *lossy) +{ + int len; + + *uni = 0x0; + + if (ch[0] < 0x80) { + *uni = (u16) ch[0]; + return 1; + } + + len = nls->char2uni(ch, NLS_MAX_CHARSET_SIZE, uni); + if (len < 0) { + /* conversion failed */ + printk("%s: fail to use nls\n", __func__); + if (lossy != NULL) + *lossy = TRUE; + *uni = (u16) '_'; + if (!strcmp(nls->charset, "utf8")) + return 1; + else + return 2; + } + + return len; +} /* end of convert_ch_to_uni */ + +static s32 convert_uni_to_ch(struct nls_table *nls, u8 *ch, u16 uni, s32 *lossy) +{ + int len; + + ch[0] = 0x0; + + if (uni < 0x0080) { + ch[0] = (u8) uni; + return 1; + } + + len = nls->uni2char(uni, ch, NLS_MAX_CHARSET_SIZE); + if (len < 0) { + /* conversion failed */ + printk("%s: fail to use nls\n", __func__); + if (lossy != NULL) + *lossy = TRUE; + ch[0] = '_'; + return 1; + } + + return len; + +} /* end of convert_uni_to_ch */ diff --git b/fs/exfat/exfat_nls.h b/fs/exfat/exfat_nls.h new file mode 100644 index 0000000..bc516d7 --- /dev/null +++ b/fs/exfat/exfat_nls.h @@ -0,0 +1,91 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_nls.h */ +/* PURPOSE : Header File for exFAT NLS Manager */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#ifndef _EXFAT_NLS_H +#define _EXFAT_NLS_H + +#include +#include + +#include "exfat_config.h" +#include "exfat_api.h" + +/*----------------------------------------------------------------------*/ +/* Constant & Macro Definitions */ +/*----------------------------------------------------------------------*/ + +#define NUM_UPCASE 2918 + +#define DOS_CUR_DIR_NAME ". " +#define DOS_PAR_DIR_NAME ".. " + +#ifdef __LITTLE_ENDIAN +#define UNI_CUR_DIR_NAME ".\0" +#define UNI_PAR_DIR_NAME ".\0.\0" +#else +#define UNI_CUR_DIR_NAME "\0." +#define UNI_PAR_DIR_NAME "\0.\0." +#endif + +/*----------------------------------------------------------------------*/ +/* Type Definitions */ +/*----------------------------------------------------------------------*/ + +/* DOS name stucture */ +typedef struct { + u8 name[DOS_NAME_LENGTH]; + u8 name_case; +} DOS_NAME_T; + +/* unicode name stucture */ +typedef struct { + u16 name[MAX_NAME_LENGTH]; + u16 name_hash; + u8 name_len; +} UNI_NAME_T; + +/*----------------------------------------------------------------------*/ +/* External Function Declarations */ +/*----------------------------------------------------------------------*/ + +/* NLS management function */ +u16 nls_upper(struct super_block *sb, u16 a); +s32 nls_dosname_cmp(struct super_block *sb, u8 *a, u8 *b); +s32 nls_uniname_cmp(struct super_block *sb, u16 *a, u16 *b); +void nls_uniname_to_dosname(struct super_block *sb, DOS_NAME_T *p_dosname, UNI_NAME_T *p_uniname, s32 *p_lossy); +void nls_dosname_to_uniname(struct super_block *sb, UNI_NAME_T *p_uniname, DOS_NAME_T *p_dosname); +void nls_uniname_to_cstring(struct super_block *sb, u8 *p_cstring, UNI_NAME_T *p_uniname); +void nls_cstring_to_uniname(struct super_block *sb, UNI_NAME_T *p_uniname, u8 *p_cstring, s32 *p_lossy); + +#endif /* _EXFAT_NLS_H */ diff --git b/fs/exfat/exfat_oal.c b/fs/exfat/exfat_oal.c new file mode 100644 index 0000000..9b33998 --- /dev/null +++ b/fs/exfat/exfat_oal.c @@ -0,0 +1,190 @@ +/* Some of the source code in this file came from "linux/fs/fat/misc.c". */ +/* + * linux/fs/fat/misc.c + * + * Written 1992,1993 by Werner Almesberger + * 22/11/2000 - Fixed fat_date_unix2dos for dates earlier than 01/01/1980 + * and date_dos2unix for date==0 by Igor Zhbanov(bsg@uniyar.ac.ru) + */ + +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_oal.c */ +/* PURPOSE : exFAT OS Adaptation Layer */ +/* (Semaphore Functions & Real-Time Clock Functions) */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#include +#include + +#include "exfat_config.h" +#include "exfat_api.h" +#include "exfat_oal.h" + +/*======================================================================*/ +/* */ +/* SEMAPHORE FUNCTIONS */ +/* */ +/*======================================================================*/ + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) +DECLARE_MUTEX(z_sem); +#else +DEFINE_SEMAPHORE(z_sem); +#endif + +s32 sm_init(struct semaphore *sm) +{ + sema_init(sm, 1); + return 0; +} /* end of sm_init */ + +s32 sm_P(struct semaphore *sm) +{ + down(sm); + return 0; +} /* end of sm_P */ + +void sm_V(struct semaphore *sm) +{ + up(sm); +} /* end of sm_V */ + + +/*======================================================================*/ +/* */ +/* REAL-TIME CLOCK FUNCTIONS */ +/* */ +/*======================================================================*/ + +extern struct timezone sys_tz; + +/* + * The epoch of FAT timestamp is 1980. + * : bits : value + * date: 0 - 4: day (1 - 31) + * date: 5 - 8: month (1 - 12) + * date: 9 - 15: year (0 - 127) from 1980 + * time: 0 - 4: sec (0 - 29) 2sec counts + * time: 5 - 10: min (0 - 59) + * time: 11 - 15: hour (0 - 23) + */ +#define UNIX_SECS_1980 315532800L + +#if BITS_PER_LONG == 64 +#define UNIX_SECS_2108 4354819200L +#endif +/* days between 1.1.70 and 1.1.80 (2 leap days) */ +#define DAYS_DELTA_DECADE (365 * 10 + 2) +/* 120 (2100 - 1980) isn't leap year */ +#define NO_LEAP_YEAR_2100 (120) +#define IS_LEAP_YEAR(y) (!((y) & 3) && (y) != NO_LEAP_YEAR_2100) + +#define SECS_PER_MIN (60) +#define SECS_PER_HOUR (60 * SECS_PER_MIN) +#define SECS_PER_DAY (24 * SECS_PER_HOUR) + +#define MAKE_LEAP_YEAR(leap_year, year) \ + do { \ + if (unlikely(year > NO_LEAP_YEAR_2100)) \ + leap_year = ((year + 3) / 4) - 1; \ + else \ + leap_year = ((year + 3) / 4); \ + } while (0) + +/* Linear day numbers of the respective 1sts in non-leap years. */ +static time_t accum_days_in_year[] = { + /* Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec */ + 0, 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 0, 0, 0, +}; + +TIMESTAMP_T *tm_current(TIMESTAMP_T *tp) +{ + struct timespec ts = CURRENT_TIME_SEC; + time_t second = ts.tv_sec; + time_t day, leap_day, month, year; + + second -= sys_tz.tz_minuteswest * SECS_PER_MIN; + + /* Jan 1 GMT 00:00:00 1980. But what about another time zone? */ + if (second < UNIX_SECS_1980) { + tp->sec = 0; + tp->min = 0; + tp->hour = 0; + tp->day = 1; + tp->mon = 1; + tp->year = 0; + return tp; + } +#if BITS_PER_LONG == 64 + if (second >= UNIX_SECS_2108) { + tp->sec = 59; + tp->min = 59; + tp->hour = 23; + tp->day = 31; + tp->mon = 12; + tp->year = 127; + return tp; + } +#endif + + day = second / SECS_PER_DAY - DAYS_DELTA_DECADE; + year = day / 365; + + MAKE_LEAP_YEAR(leap_day, year); + if (year * 365 + leap_day > day) + year--; + + MAKE_LEAP_YEAR(leap_day, year); + + day -= year * 365 + leap_day; + + if (IS_LEAP_YEAR(year) && day == accum_days_in_year[3]) { + month = 2; + } else { + if (IS_LEAP_YEAR(year) && day > accum_days_in_year[3]) + day--; + for (month = 1; month < 12; month++) { + if (accum_days_in_year[month + 1] > day) + break; + } + } + day -= accum_days_in_year[month]; + + tp->sec = second % SECS_PER_MIN; + tp->min = (second / SECS_PER_MIN) % 60; + tp->hour = (second / SECS_PER_HOUR) % 24; + tp->day = day + 1; + tp->mon = month; + tp->year = year; + + return tp; +} /* end of tm_current */ diff --git b/fs/exfat/exfat_oal.h b/fs/exfat/exfat_oal.h new file mode 100644 index 0000000..b6dd789 --- /dev/null +++ b/fs/exfat/exfat_oal.h @@ -0,0 +1,74 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_oal.h */ +/* PURPOSE : Header File for exFAT OS Adaptation Layer */ +/* (Semaphore Functions & Real-Time Clock Functions) */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#ifndef _EXFAT_OAL_H +#define _EXFAT_OAL_H + +#include +#include "exfat_config.h" +#include + +/*----------------------------------------------------------------------*/ +/* Constant & Macro Definitions (Configurable) */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* Constant & Macro Definitions (Non-Configurable) */ +/*----------------------------------------------------------------------*/ + +/*----------------------------------------------------------------------*/ +/* Type Definitions */ +/*----------------------------------------------------------------------*/ + +typedef struct { + u16 sec; /* 0 ~ 59 */ + u16 min; /* 0 ~ 59 */ + u16 hour; /* 0 ~ 23 */ + u16 day; /* 1 ~ 31 */ + u16 mon; /* 1 ~ 12 */ + u16 year; /* 0 ~ 127 (since 1980) */ +} TIMESTAMP_T; + +/*----------------------------------------------------------------------*/ +/* External Function Declarations */ +/*----------------------------------------------------------------------*/ + +s32 sm_init(struct semaphore *sm); +s32 sm_P(struct semaphore *sm); +void sm_V(struct semaphore *sm); + +TIMESTAMP_T *tm_current(TIMESTAMP_T *tm); + +#endif /* _EXFAT_OAL_H */ diff --git b/fs/exfat/exfat_super.c b/fs/exfat/exfat_super.c new file mode 100644 index 0000000..a003a72 --- /dev/null +++ b/fs/exfat/exfat_super.c @@ -0,0 +1,2600 @@ +/* Some of the source code in this file came from "linux/fs/fat/file.c","linux/fs/fat/inode.c" and "linux/fs/fat/misc.c". */ +/* + * linux/fs/fat/file.c + * + * Written 1992,1993 by Werner Almesberger + * + * regular file handling primitives for fat-based filesystems + */ + +/* + * linux/fs/fat/inode.c + * + * Written 1992,1993 by Werner Almesberger + * VFAT extensions by Gordon Chaffee, merged with msdos fs by Henrik Storner + * Rewritten for the constant inumbers support by Al Viro + * + * Fixes: + * + * Max Cohan: Fixed invalid FSINFO offset when info_sector is 0 + */ + +/* + * linux/fs/fat/misc.c + * + * Written 1992,1993 by Werner Almesberger + * 22/11/2000 - Fixed fat_date_unix2dos for dates earlier than 01/01/1980 + * and date_dos2unix for date==0 by Igor Zhbanov(bsg@uniyar.ac.ru) + */ + +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +#include +#include +#include +#include +#include +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,37) +#include +#endif +#include +#include +#include +#include +#include +#include +#include +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0) +#include +#endif +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "exfat_version.h" +#include "exfat_config.h" +#include "exfat_data.h" +#include "exfat_oal.h" + +#include "exfat_blkdev.h" +#include "exfat_cache.h" +#include "exfat_nls.h" +#include "exfat_api.h" +#include "exfat_core.h" + +#include "exfat_super.h" + +static struct kmem_cache *exfat_inode_cachep; + +static int exfat_default_codepage = CONFIG_EXFAT_DEFAULT_CODEPAGE; +static char exfat_default_iocharset[] = CONFIG_EXFAT_DEFAULT_IOCHARSET; + +extern struct timezone sys_tz; + +#define CHECK_ERR(x) BUG_ON(x) + +#define UNIX_SECS_1980 315532800L + +#if BITS_PER_LONG == 64 +#define UNIX_SECS_2108 4354819200L +#endif +/* days between 1.1.70 and 1.1.80 (2 leap days) */ +#define DAYS_DELTA_DECADE (365 * 10 + 2) +/* 120 (2100 - 1980) isn't leap year */ +#define NO_LEAP_YEAR_2100 (120) +#define IS_LEAP_YEAR(y) (!((y) & 0x3) && (y) != NO_LEAP_YEAR_2100) + +#define SECS_PER_MIN (60) +#define SECS_PER_HOUR (60 * SECS_PER_MIN) +#define SECS_PER_DAY (24 * SECS_PER_HOUR) + +#define MAKE_LEAP_YEAR(leap_year, year) \ + do { \ + if (unlikely(year > NO_LEAP_YEAR_2100)) \ + leap_year = ((year + 3) / 4) - 1; \ + else \ + leap_year = ((year + 3) / 4); \ + } while (0) + +/* Linear day numbers of the respective 1sts in non-leap years. */ +static time_t accum_days_in_year[] = { + /* Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec */ + 0, 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 0, 0, 0, +}; + +static void _exfat_truncate(struct inode *inode, loff_t old_size); + +/* Convert a FAT time/date pair to a UNIX date (seconds since 1 1 70). */ +void exfat_time_fat2unix(struct exfat_sb_info *sbi, struct timespec *ts, + DATE_TIME_T *tp) +{ + time_t year = tp->Year; + time_t ld; + + MAKE_LEAP_YEAR(ld, year); + + if (IS_LEAP_YEAR(year) && (tp->Month) > 2) + ld++; + + ts->tv_sec = tp->Second + tp->Minute * SECS_PER_MIN + + tp->Hour * SECS_PER_HOUR + + (year * 365 + ld + accum_days_in_year[(tp->Month)] + (tp->Day - 1) + DAYS_DELTA_DECADE) * SECS_PER_DAY + + sys_tz.tz_minuteswest * SECS_PER_MIN; + ts->tv_nsec = 0; +} + +/* Convert linear UNIX date to a FAT time/date pair. */ +void exfat_time_unix2fat(struct exfat_sb_info *sbi, struct timespec *ts, + DATE_TIME_T *tp) +{ + time_t second = ts->tv_sec; + time_t day, month, year; + time_t ld; + + second -= sys_tz.tz_minuteswest * SECS_PER_MIN; + + /* Jan 1 GMT 00:00:00 1980. But what about another time zone? */ + if (second < UNIX_SECS_1980) { + tp->Second = 0; + tp->Minute = 0; + tp->Hour = 0; + tp->Day = 1; + tp->Month = 1; + tp->Year = 0; + return; + } +#if (BITS_PER_LONG == 64) + if (second >= UNIX_SECS_2108) { + tp->Second = 59; + tp->Minute = 59; + tp->Hour = 23; + tp->Day = 31; + tp->Month = 12; + tp->Year = 127; + return; + } +#endif + day = second / SECS_PER_DAY - DAYS_DELTA_DECADE; + year = day / 365; + MAKE_LEAP_YEAR(ld, year); + if (year * 365 + ld > day) + year--; + + MAKE_LEAP_YEAR(ld, year); + day -= year * 365 + ld; + + if (IS_LEAP_YEAR(year) && day == accum_days_in_year[3]) { + month = 2; + } else { + if (IS_LEAP_YEAR(year) && day > accum_days_in_year[3]) + day--; + for (month = 1; month < 12; month++) { + if (accum_days_in_year[month + 1] > day) + break; + } + } + day -= accum_days_in_year[month]; + + tp->Second = second % SECS_PER_MIN; + tp->Minute = (second / SECS_PER_MIN) % 60; + tp->Hour = (second / SECS_PER_HOUR) % 24; + tp->Day = day + 1; + tp->Month = month; + tp->Year = year; +} + +static struct inode *exfat_iget(struct super_block *sb, loff_t i_pos); +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) +static int exfat_generic_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg); +#else +static long exfat_generic_ioctl(struct file *filp, unsigned int cmd, unsigned long arg); +#endif +static int exfat_sync_inode(struct inode *inode); +static struct inode *exfat_build_inode(struct super_block *sb, FILE_ID_T *fid, loff_t i_pos); +static void exfat_detach(struct inode *inode); +static void exfat_attach(struct inode *inode, loff_t i_pos); +static inline unsigned long exfat_hash(loff_t i_pos); +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,34) +static int exfat_write_inode(struct inode *inode, int wait); +#else +static int exfat_write_inode(struct inode *inode, struct writeback_control *wbc); +#endif +static void exfat_write_super(struct super_block *sb); + +static void __lock_super(struct super_block *sb) +{ +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,7,0) + lock_super(sb); +#else + struct exfat_sb_info *sbi = EXFAT_SB(sb); + mutex_lock(&sbi->s_lock); +#endif +} + +static void __unlock_super(struct super_block *sb) +{ +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,7,0) + unlock_super(sb); +#else + struct exfat_sb_info *sbi = EXFAT_SB(sb); + mutex_unlock(&sbi->s_lock); +#endif +} + +static int __is_sb_dirty(struct super_block *sb) +{ +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,7,0) + return sb->s_dirt; +#else + struct exfat_sb_info *sbi = EXFAT_SB(sb); + return sbi->s_dirt; +#endif +} + +static void __set_sb_clean(struct super_block *sb) +{ +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,7,0) + sb->s_dirt = 0; +#else + struct exfat_sb_info *sbi = EXFAT_SB(sb); + sbi->s_dirt = 0; +#endif +} + +static int __exfat_revalidate(struct dentry *dentry) +{ + return 0; +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,00) +static int exfat_revalidate(struct dentry *dentry, unsigned int flags) +#else +static int exfat_revalidate(struct dentry *dentry, struct nameidata *nd) +#endif +{ +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,00) + if (flags & LOOKUP_RCU) + return -ECHILD; +#elif LINUX_VERSION_CODE >= KERNEL_VERSION(3,0,00) + if (nd && nd->flags & LOOKUP_RCU) + return -ECHILD; +#endif + + if (dentry->d_inode) + return 1; + return __exfat_revalidate(dentry); +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,00) +static int exfat_revalidate_ci(struct dentry *dentry, unsigned int flags) +#else +static int exfat_revalidate_ci(struct dentry *dentry, struct nameidata *nd) +#endif +{ +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,00) + if (flags & LOOKUP_RCU) + return -ECHILD; +#else + unsigned int flags; + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,0,00) + if (nd && nd->flags & LOOKUP_RCU) + return -ECHILD; +#endif + + flags = nd ? nd->flags : 0; +#endif + + if (dentry->d_inode) + return 1; + + if (!flags) + return 0; + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,0,00) + if (flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET)) + return 0; +#else + if (!(nd->flags & (LOOKUP_CONTINUE | LOOKUP_PARENT))) { + if (nd->flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET)) + return 0; + } +#endif + + return __exfat_revalidate(dentry); +} + +static unsigned int __exfat_striptail_len(unsigned int len, const char *name) +{ + while (len && name[len - 1] == '.') + len--; + return len; +} + +static unsigned int exfat_striptail_len(const struct qstr *qstr) +{ + return __exfat_striptail_len(qstr->len, qstr->name); +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) +static int exfat_d_hash(const struct dentry *dentry, struct qstr *qstr) +#elif LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) +static int exfat_d_hash(struct dentry *dentry, struct qstr *qstr) +#else +static int exfat_d_hash(const struct dentry *dentry, const struct inode *inode, + struct qstr *qstr) +#endif +{ + qstr->hash = full_name_hash(qstr->name, exfat_striptail_len(qstr)); + return 0; +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) +static int exfat_d_hashi(const struct dentry *dentry, struct qstr *qstr) +#elif LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) +static int exfat_d_hashi(struct dentry *dentry, struct qstr *qstr) +#else +static int exfat_d_hashi(const struct dentry *dentry, const struct inode *inode, + struct qstr *qstr) +#endif +{ + struct super_block *sb = dentry->d_sb; + const unsigned char *name; + unsigned int len; + unsigned long hash; + + name = qstr->name; + len = exfat_striptail_len(qstr); + + hash = init_name_hash(); + while (len--) + hash = partial_name_hash(nls_upper(sb, *name++), hash); + qstr->hash = end_name_hash(hash); + + return 0; +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) +static int exfat_cmpi(const struct dentry *parent, const struct dentry *dentry, + unsigned int len, const char *str, const struct qstr *name) +#elif LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) +static int exfat_cmpi(struct dentry *parent, struct qstr *a, struct qstr *b) +#else +static int exfat_cmpi(const struct dentry *parent, const struct inode *pinode, + const struct dentry *dentry, const struct inode *inode, + unsigned int len, const char *str, const struct qstr *name) +#endif +{ + struct nls_table *t = EXFAT_SB(parent->d_sb)->nls_io; + unsigned int alen, blen; + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) + alen = exfat_striptail_len(a); + blen = exfat_striptail_len(b); +#else + alen = exfat_striptail_len(name); + blen = __exfat_striptail_len(len, str); +#endif + if (alen == blen) { + if (t == NULL) { +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) + if (strncasecmp(a->name, b->name, alen) == 0) +#else + if (strncasecmp(name->name, str, alen) == 0) +#endif + return 0; +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) + } else if (nls_strnicmp(t, a->name, b->name, alen) == 0) +#else + } else if (nls_strnicmp(t, name->name, str, alen) == 0) +#endif + return 0; + } + return 1; +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) +static int exfat_cmp(const struct dentry *parent, const struct dentry *dentry, + unsigned int len, const char *str, const struct qstr *name) +#elif LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) +static int exfat_cmp(struct dentry *parent, struct qstr *a, + struct qstr *b) +#else +static int exfat_cmp(const struct dentry *parent, const struct inode *pinode, + const struct dentry *dentry, const struct inode *inode, + unsigned int len, const char *str, const struct qstr *name) +#endif +{ + unsigned int alen, blen; + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) + alen = exfat_striptail_len(a); + blen = exfat_striptail_len(b); +#else + alen = exfat_striptail_len(name); + blen = __exfat_striptail_len(len, str); +#endif + if (alen == blen) { +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) + if (strncmp(a->name, b->name, alen) == 0) +#else + if (strncmp(name->name, str, alen) == 0) +#endif + return 0; + } + return 1; +} + +static const struct dentry_operations exfat_ci_dentry_ops = { + .d_revalidate = exfat_revalidate_ci, + .d_hash = exfat_d_hashi, + .d_compare = exfat_cmpi, +}; + +static const struct dentry_operations exfat_dentry_ops = { + .d_revalidate = exfat_revalidate, + .d_hash = exfat_d_hash, + .d_compare = exfat_cmp, +}; + +/*======================================================================*/ +/* Directory Entry Operations */ +/*======================================================================*/ + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) +static int exfat_readdir(struct file *filp, struct dir_context *ctx) +#else +static int exfat_readdir(struct file *filp, void *dirent, filldir_t filldir) +#endif +{ +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,9,0) + struct inode *inode = file_inode(filp); +#else + struct inode *inode = filp->f_path.dentry->d_inode; +#endif + struct super_block *sb = inode->i_sb; + struct exfat_sb_info *sbi = EXFAT_SB(sb); + FS_INFO_T *p_fs = &(sbi->fs_info); + BD_INFO_T *p_bd = &(EXFAT_SB(sb)->bd_info); + DIR_ENTRY_T de; + unsigned long inum; + loff_t cpos; + int err = 0; + + __lock_super(sb); + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) + cpos = ctx->pos; +#else + cpos = filp->f_pos; +#endif + /* Fake . and .. for the root directory. */ + if ((p_fs->vol_type == EXFAT) || (inode->i_ino == EXFAT_ROOT_INO)) { + while (cpos < 2) { + if (inode->i_ino == EXFAT_ROOT_INO) + inum = EXFAT_ROOT_INO; + else if (cpos == 0) + inum = inode->i_ino; + else /* (cpos == 1) */ + inum = parent_ino(filp->f_path.dentry); + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) + if (!dir_emit_dots(filp, ctx)) +#else + if (filldir(dirent, "..", cpos+1, cpos, inum, DT_DIR) < 0) +#endif + goto out; + cpos++; +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) + ctx->pos++; +#else + filp->f_pos++; +#endif + } + if (cpos == 2) + cpos = 0; + } + if (cpos & (DENTRY_SIZE - 1)) { + err = -ENOENT; + goto out; + } + +get_new: + EXFAT_I(inode)->fid.size = i_size_read(inode); + EXFAT_I(inode)->fid.rwoffset = cpos >> DENTRY_SIZE_BITS; + + err = FsReadDir(inode, &de); + if (err) { + /* at least we tried to read a sector + * move cpos to next sector position (should be aligned) + */ + if (err == FFS_MEDIAERR) { + cpos += 1 << p_bd->sector_size_bits; + cpos &= ~((1 << p_bd->sector_size_bits)-1); + } + + err = -EIO; + goto end_of_dir; + } + + cpos = EXFAT_I(inode)->fid.rwoffset << DENTRY_SIZE_BITS; + + if (!de.Name[0]) + goto end_of_dir; + + if (!memcmp(de.ShortName, DOS_CUR_DIR_NAME, DOS_NAME_LENGTH)) { + inum = inode->i_ino; + } else if (!memcmp(de.ShortName, DOS_PAR_DIR_NAME, DOS_NAME_LENGTH)) { + inum = parent_ino(filp->f_path.dentry); + } else { + loff_t i_pos = ((loff_t) EXFAT_I(inode)->fid.start_clu << 32) | + ((EXFAT_I(inode)->fid.rwoffset-1) & 0xffffffff); + + struct inode *tmp = exfat_iget(sb, i_pos); + if (tmp) { + inum = tmp->i_ino; + iput(tmp); + } else { + inum = iunique(sb, EXFAT_ROOT_INO); + } + } + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) + if (!dir_emit(ctx, de.Name, strlen(de.Name), inum, + (de.Attr & ATTR_SUBDIR) ? DT_DIR : DT_REG)) +#else + if (filldir(dirent, de.Name, strlen(de.Name), cpos-1, inum, + (de.Attr & ATTR_SUBDIR) ? DT_DIR : DT_REG) < 0) +#endif + goto out; + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) + ctx->pos = cpos; +#else + filp->f_pos = cpos; +#endif + goto get_new; + +end_of_dir: +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) + ctx->pos = cpos; +#else + filp->f_pos = cpos; +#endif +out: + __unlock_super(sb); + return err; +} + +static int exfat_ioctl_volume_id(struct inode *dir) +{ + struct super_block *sb = dir->i_sb; + struct exfat_sb_info *sbi = EXFAT_SB(sb); + FS_INFO_T *p_fs = &(sbi->fs_info); + + return p_fs->vol_id; +} + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) +static int exfat_generic_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +#else +static long exfat_generic_ioctl(struct file *filp, + unsigned int cmd, unsigned long arg) +#endif +{ +#if !(LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36)) + #if !(LINUX_VERSION_CODE < KERNEL_VERSION(3,18,3)) + struct inode *inode = filp->f_path.dentry->d_inode; + #else + struct inode *inode = filp->f_dentry->d_inode; + #endif +#endif +#ifdef CONFIG_EXFAT_KERNEL_DEBUG + unsigned int flags; +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ + + switch (cmd) { + case EXFAT_IOCTL_GET_VOLUME_ID: + return exfat_ioctl_volume_id(inode); +#ifdef CONFIG_EXFAT_KERNEL_DEBUG + case EXFAT_IOC_GET_DEBUGFLAGS: { + struct super_block *sb = inode->i_sb; + struct exfat_sb_info *sbi = EXFAT_SB(sb); + + flags = sbi->debug_flags; + return put_user(flags, (int __user *)arg); + } + case EXFAT_IOC_SET_DEBUGFLAGS: { + struct super_block *sb = inode->i_sb; + struct exfat_sb_info *sbi = EXFAT_SB(sb); + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (get_user(flags, (int __user *) arg)) + return -EFAULT; + + __lock_super(sb); + sbi->debug_flags = flags; + __unlock_super(sb); + + return 0; + } +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ + default: + return -ENOTTY; /* Inappropriate ioctl for device */ + } +} + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,35) +static int exfat_file_fsync(struct file *filp, struct dentry *dentry, + int datasync) +#else +static int exfat_file_fsync(struct file *filp, int datasync) +#endif +{ + struct inode *inode = filp->f_mapping->host; + struct super_block *sb = inode->i_sb; + int res, err; + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,35) + res = simple_fsync(filp, dentry, datasync); +#else + res = generic_file_fsync(filp, datasync); +#endif + err = FsSyncVol(sb, 1); + + return res ? res : err; +} +#endif + +const struct file_operations exfat_dir_operations = { + .llseek = generic_file_llseek, + .read = generic_read_dir, +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) + .iterate = exfat_readdir, +#else + .readdir = exfat_readdir, +#endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) + .ioctl = exfat_generic_ioctl, + .fsync = exfat_file_fsync, +#else + .unlocked_ioctl = exfat_generic_ioctl, + .fsync = generic_file_fsync, +#endif +}; + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,00) +static int exfat_create(struct inode *dir, struct dentry *dentry, umode_t mode, + bool excl) +#elif LINUX_VERSION_CODE >= KERNEL_VERSION(3,3,0) +static int exfat_create(struct inode *dir, struct dentry *dentry, umode_t mode, + struct nameidata *nd) +#else +static int exfat_create(struct inode *dir, struct dentry *dentry, int mode, + struct nameidata *nd) +#endif +{ + struct super_block *sb = dir->i_sb; + struct inode *inode; + struct timespec ts; + FILE_ID_T fid; + loff_t i_pos; + int err; + + __lock_super(sb); + + DPRINTK("exfat_create entered\n"); + + ts = CURRENT_TIME_SEC; + + err = FsCreateFile(dir, (u8 *) dentry->d_name.name, FM_REGULAR, &fid); + if (err) { + if (err == FFS_INVALIDPATH) + err = -EINVAL; + else if (err == FFS_FILEEXIST) + err = -EEXIST; + else if (err == FFS_FULL) + err = -ENOSPC; + else if (err == FFS_NAMETOOLONG) + err = -ENAMETOOLONG; + else + err = -EIO; + goto out; + } + dir->i_version++; + dir->i_ctime = dir->i_mtime = dir->i_atime = ts; + if (IS_DIRSYNC(dir)) + (void) exfat_sync_inode(dir); + else + mark_inode_dirty(dir); + + i_pos = ((loff_t) fid.dir.dir << 32) | (fid.entry & 0xffffffff); + + inode = exfat_build_inode(sb, &fid, i_pos); + if (IS_ERR(inode)) { + err = PTR_ERR(inode); + goto out; + } + inode->i_version++; + inode->i_mtime = inode->i_atime = inode->i_ctime = ts; + /* timestamp is already written, so mark_inode_dirty() is unnecessary. */ + + dentry->d_time = dentry->d_parent->d_inode->i_version; + d_instantiate(dentry, inode); + +out: + __unlock_super(sb); + DPRINTK("exfat_create exited\n"); + return err; +} + +static int exfat_find(struct inode *dir, struct qstr *qname, + FILE_ID_T *fid) +{ + int err; + + if (qname->len == 0) + return -ENOENT; + + err = FsLookupFile(dir, (u8 *) qname->name, fid); + if (err) + return -ENOENT; + + return 0; +} + +static int exfat_d_anon_disconn(struct dentry *dentry) +{ + return IS_ROOT(dentry) && (dentry->d_flags & DCACHE_DISCONNECTED); +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,00) +static struct dentry *exfat_lookup(struct inode *dir, struct dentry *dentry, + unsigned int flags) +#else +static struct dentry *exfat_lookup(struct inode *dir, struct dentry *dentry, + struct nameidata *nd) +#endif +{ + struct super_block *sb = dir->i_sb; + struct inode *inode; + struct dentry *alias; + int err; + FILE_ID_T fid; + loff_t i_pos; + u64 ret; + mode_t i_mode; + + __lock_super(sb); + DPRINTK("exfat_lookup entered\n"); + err = exfat_find(dir, &dentry->d_name, &fid); + if (err) { + if (err == -ENOENT) { + inode = NULL; + goto out; + } + goto error; + } + + i_pos = ((loff_t) fid.dir.dir << 32) | (fid.entry & 0xffffffff); + inode = exfat_build_inode(sb, &fid, i_pos); + if (IS_ERR(inode)) { + err = PTR_ERR(inode); + goto error; + } + + i_mode = inode->i_mode; + if (S_ISLNK(i_mode)) { + EXFAT_I(inode)->target = kmalloc(i_size_read(inode)+1, GFP_KERNEL); + if (!EXFAT_I(inode)->target) { + err = -ENOMEM; + goto error; + } + FsReadFile(dir, &fid, EXFAT_I(inode)->target, i_size_read(inode), &ret); + *(EXFAT_I(inode)->target + i_size_read(inode)) = '\0'; + } + + alias = d_find_alias(inode); + if (alias && !exfat_d_anon_disconn(alias)) { + CHECK_ERR(d_unhashed(alias)); + if (!S_ISDIR(i_mode)) + d_move(alias, dentry); + iput(inode); + __unlock_super(sb); + DPRINTK("exfat_lookup exited 1\n"); + return alias; + } else { + dput(alias); + } +out: + __unlock_super(sb); + dentry->d_time = dentry->d_parent->d_inode->i_version; +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38) + dentry->d_op = sb->s_root->d_op; + dentry = d_splice_alias(inode, dentry); + if (dentry) { + dentry->d_op = sb->s_root->d_op; + dentry->d_time = dentry->d_parent->d_inode->i_version; + } +#else + dentry = d_splice_alias(inode, dentry); + if (dentry) + dentry->d_time = dentry->d_parent->d_inode->i_version; +#endif + DPRINTK("exfat_lookup exited 2\n"); + return dentry; + +error: + __unlock_super(sb); + DPRINTK("exfat_lookup exited 3\n"); + return ERR_PTR(err); +} + +static int exfat_unlink(struct inode *dir, struct dentry *dentry) +{ + struct inode *inode = dentry->d_inode; + struct super_block *sb = dir->i_sb; + struct timespec ts; + int err; + + __lock_super(sb); + + DPRINTK("exfat_unlink entered\n"); + + ts = CURRENT_TIME_SEC; + + EXFAT_I(inode)->fid.size = i_size_read(inode); + + err = FsRemoveFile(dir, &(EXFAT_I(inode)->fid)); + if (err) { + if (err == FFS_PERMISSIONERR) + err = -EPERM; + else + err = -EIO; + goto out; + } + dir->i_version++; + dir->i_mtime = dir->i_atime = ts; + if (IS_DIRSYNC(dir)) + (void) exfat_sync_inode(dir); + else + mark_inode_dirty(dir); + + clear_nlink(inode); + inode->i_mtime = inode->i_atime = ts; + exfat_detach(inode); + remove_inode_hash(inode); + +out: + __unlock_super(sb); + DPRINTK("exfat_unlink exited\n"); + return err; +} + +static int exfat_symlink(struct inode *dir, struct dentry *dentry, const char *target) +{ + struct super_block *sb = dir->i_sb; + struct inode *inode; + struct timespec ts; + FILE_ID_T fid; + loff_t i_pos; + int err; + u64 len = (u64) strlen(target); + u64 ret; + + __lock_super(sb); + + DPRINTK("exfat_symlink entered\n"); + + ts = CURRENT_TIME_SEC; + + err = FsCreateFile(dir, (u8 *) dentry->d_name.name, FM_SYMLINK, &fid); + if (err) { + if (err == FFS_INVALIDPATH) + err = -EINVAL; + else if (err == FFS_FILEEXIST) + err = -EEXIST; + else if (err == FFS_FULL) + err = -ENOSPC; + else + err = -EIO; + goto out; + } + + err = FsWriteFile(dir, &fid, (char *) target, len, &ret); + + if (err) { + FsRemoveFile(dir, &fid); + + if (err == FFS_FULL) + err = -ENOSPC; + else + err = -EIO; + goto out; + } + + dir->i_version++; + dir->i_ctime = dir->i_mtime = dir->i_atime = ts; + if (IS_DIRSYNC(dir)) + (void) exfat_sync_inode(dir); + else + mark_inode_dirty(dir); + + i_pos = ((loff_t) fid.dir.dir << 32) | (fid.entry & 0xffffffff); + + inode = exfat_build_inode(sb, &fid, i_pos); + if (IS_ERR(inode)) { + err = PTR_ERR(inode); + goto out; + } + inode->i_version++; + inode->i_mtime = inode->i_atime = inode->i_ctime = ts; + /* timestamp is already written, so mark_inode_dirty() is unneeded. */ + + EXFAT_I(inode)->target = kmalloc(len+1, GFP_KERNEL); + if (!EXFAT_I(inode)->target) { + err = -ENOMEM; + goto out; + } + memcpy(EXFAT_I(inode)->target, target, len+1); + + dentry->d_time = dentry->d_parent->d_inode->i_version; + d_instantiate(dentry, inode); + +out: + __unlock_super(sb); + DPRINTK("exfat_symlink exited\n"); + return err; +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,3,0) +static int exfat_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) +#else +static int exfat_mkdir(struct inode *dir, struct dentry *dentry, int mode) +#endif +{ + struct super_block *sb = dir->i_sb; + struct inode *inode; + struct timespec ts; + FILE_ID_T fid; + loff_t i_pos; + int err; + + __lock_super(sb); + + DPRINTK("exfat_mkdir entered\n"); + + ts = CURRENT_TIME_SEC; + + err = FsCreateDir(dir, (u8 *) dentry->d_name.name, &fid); + if (err) { + if (err == FFS_INVALIDPATH) + err = -EINVAL; + else if (err == FFS_FILEEXIST) + err = -EEXIST; + else if (err == FFS_FULL) + err = -ENOSPC; + else if (err == FFS_NAMETOOLONG) + err = -ENAMETOOLONG; + else + err = -EIO; + goto out; + } + dir->i_version++; + dir->i_ctime = dir->i_mtime = dir->i_atime = ts; + if (IS_DIRSYNC(dir)) + (void) exfat_sync_inode(dir); + else + mark_inode_dirty(dir); + inc_nlink(dir); + + i_pos = ((loff_t) fid.dir.dir << 32) | (fid.entry & 0xffffffff); + + inode = exfat_build_inode(sb, &fid, i_pos); + if (IS_ERR(inode)) { + err = PTR_ERR(inode); + goto out; + } + inode->i_version++; + inode->i_mtime = inode->i_atime = inode->i_ctime = ts; + /* timestamp is already written, so mark_inode_dirty() is unneeded. */ + + dentry->d_time = dentry->d_parent->d_inode->i_version; + d_instantiate(dentry, inode); + +out: + __unlock_super(sb); + DPRINTK("exfat_mkdir exited\n"); + return err; +} + +static int exfat_rmdir(struct inode *dir, struct dentry *dentry) +{ + struct inode *inode = dentry->d_inode; + struct super_block *sb = dir->i_sb; + struct timespec ts; + int err; + + __lock_super(sb); + + DPRINTK("exfat_rmdir entered\n"); + + ts = CURRENT_TIME_SEC; + + EXFAT_I(inode)->fid.size = i_size_read(inode); + + err = FsRemoveDir(dir, &(EXFAT_I(inode)->fid)); + if (err) { + if (err == FFS_INVALIDPATH) + err = -EINVAL; + else if (err == FFS_FILEEXIST) + err = -ENOTEMPTY; + else if (err == FFS_NOTFOUND) + err = -ENOENT; + else if (err == FFS_DIRBUSY) + err = -EBUSY; + else + err = -EIO; + goto out; + } + dir->i_version++; + dir->i_mtime = dir->i_atime = ts; + if (IS_DIRSYNC(dir)) + (void) exfat_sync_inode(dir); + else + mark_inode_dirty(dir); + drop_nlink(dir); + + clear_nlink(inode); + inode->i_mtime = inode->i_atime = ts; + exfat_detach(inode); + remove_inode_hash(inode); + +out: + __unlock_super(sb); + DPRINTK("exfat_rmdir exited\n"); + return err; +} + +static int exfat_rename(struct inode *old_dir, struct dentry *old_dentry, + struct inode *new_dir, struct dentry *new_dentry) +{ + struct inode *old_inode, *new_inode; + struct super_block *sb = old_dir->i_sb; + struct timespec ts; + loff_t i_pos; + int err; + + __lock_super(sb); + + DPRINTK("exfat_rename entered\n"); + + old_inode = old_dentry->d_inode; + new_inode = new_dentry->d_inode; + + ts = CURRENT_TIME_SEC; + + EXFAT_I(old_inode)->fid.size = i_size_read(old_inode); + + err = FsMoveFile(old_dir, &(EXFAT_I(old_inode)->fid), new_dir, new_dentry); + if (err) { + if (err == FFS_PERMISSIONERR) + err = -EPERM; + else if (err == FFS_INVALIDPATH) + err = -EINVAL; + else if (err == FFS_FILEEXIST) + err = -EEXIST; + else if (err == FFS_NOTFOUND) + err = -ENOENT; + else if (err == FFS_FULL) + err = -ENOSPC; + else + err = -EIO; + goto out; + } + new_dir->i_version++; + new_dir->i_ctime = new_dir->i_mtime = new_dir->i_atime = ts; + if (IS_DIRSYNC(new_dir)) + (void) exfat_sync_inode(new_dir); + else + mark_inode_dirty(new_dir); + + i_pos = ((loff_t) EXFAT_I(old_inode)->fid.dir.dir << 32) | + (EXFAT_I(old_inode)->fid.entry & 0xffffffff); + + exfat_detach(old_inode); + exfat_attach(old_inode, i_pos); + if (IS_DIRSYNC(new_dir)) + (void) exfat_sync_inode(old_inode); + else + mark_inode_dirty(old_inode); + + if ((S_ISDIR(old_inode->i_mode)) && (old_dir != new_dir)) { + drop_nlink(old_dir); + if (!new_inode) + inc_nlink(new_dir); + } + + old_dir->i_version++; + old_dir->i_ctime = old_dir->i_mtime = ts; + if (IS_DIRSYNC(old_dir)) + (void) exfat_sync_inode(old_dir); + else + mark_inode_dirty(old_dir); + + if (new_inode) { + exfat_detach(new_inode); + drop_nlink(new_inode); + if (S_ISDIR(new_inode->i_mode)) + drop_nlink(new_inode); + new_inode->i_ctime = ts; + } + +out: + __unlock_super(sb); + DPRINTK("exfat_rename exited\n"); + return err; +} + +static int exfat_cont_expand(struct inode *inode, loff_t size) +{ + struct address_space *mapping = inode->i_mapping; + loff_t start = i_size_read(inode), count = size - i_size_read(inode); + int err, err2; + + err = generic_cont_expand_simple(inode, size); + if (err != 0) + return err; + + inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC; + mark_inode_dirty(inode); + + if (IS_SYNC(inode)) { + err = filemap_fdatawrite_range(mapping, start, start + count - 1); + err2 = sync_mapping_buffers(mapping); + err = (err) ? (err) : (err2); + err2 = write_inode_now(inode, 1); + err = (err) ? (err) : (err2); + if (!err) +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,32) + err = wait_on_page_writeback_range(mapping, + start >> PAGE_CACHE_SHIFT, + (start + count - 1) >> PAGE_CACHE_SHIFT); +#else + err = filemap_fdatawait_range(mapping, start, start + count - 1); +#endif + } + return err; +} + +static int exfat_allow_set_time(struct exfat_sb_info *sbi, struct inode *inode) +{ + mode_t allow_utime = sbi->options.allow_utime; + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,5,0) + if (!uid_eq(current_fsuid(), inode->i_uid)) +#else + if (current_fsuid() != inode->i_uid) +#endif + { + if (in_group_p(inode->i_gid)) + allow_utime >>= 3; + if (allow_utime & MAY_WRITE) + return 1; + } + + /* use a default check */ + return 0; +} + +static int exfat_sanitize_mode(const struct exfat_sb_info *sbi, + struct inode *inode, umode_t *mode_ptr) +{ + mode_t i_mode, mask, perm; + + i_mode = inode->i_mode; + + if (S_ISREG(i_mode) || S_ISLNK(i_mode)) + mask = sbi->options.fs_fmask; + else + mask = sbi->options.fs_dmask; + + perm = *mode_ptr & ~(S_IFMT | mask); + + /* Of the r and x bits, all (subject to umask) must be present.*/ + if ((perm & (S_IRUGO | S_IXUGO)) != (i_mode & (S_IRUGO|S_IXUGO))) + return -EPERM; + + if (exfat_mode_can_hold_ro(inode)) { + /* Of the w bits, either all (subject to umask) or none must be present. */ + if ((perm & S_IWUGO) && ((perm & S_IWUGO) != (S_IWUGO & ~mask))) + return -EPERM; + } else { + /* If exfat_mode_can_hold_ro(inode) is false, can't change w bits. */ + if ((perm & S_IWUGO) != (S_IWUGO & ~mask)) + return -EPERM; + } + + *mode_ptr &= S_IFMT | perm; + + return 0; +} + +static int exfat_setattr(struct dentry *dentry, struct iattr *attr) +{ + + struct exfat_sb_info *sbi = EXFAT_SB(dentry->d_sb); + struct inode *inode = dentry->d_inode; + unsigned int ia_valid; + int error; +#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,35) + loff_t old_size; +#endif + + DPRINTK("exfat_setattr entered\n"); + + if ((attr->ia_valid & ATTR_SIZE) + && (attr->ia_size > i_size_read(inode))) { + error = exfat_cont_expand(inode, attr->ia_size); + if (error || attr->ia_valid == ATTR_SIZE) + return error; + attr->ia_valid &= ~ATTR_SIZE; + } + + ia_valid = attr->ia_valid; + + if ((ia_valid & (ATTR_MTIME_SET | ATTR_ATIME_SET | ATTR_TIMES_SET)) + && exfat_allow_set_time(sbi, inode)) { + attr->ia_valid &= ~(ATTR_MTIME_SET | ATTR_ATIME_SET | ATTR_TIMES_SET); + } + + error = inode_change_ok(inode, attr); + attr->ia_valid = ia_valid; + if (error) + return error; + + if (((attr->ia_valid & ATTR_UID) && +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,5,0) + (!uid_eq(attr->ia_uid, sbi->options.fs_uid))) || + ((attr->ia_valid & ATTR_GID) && + (!gid_eq(attr->ia_gid, sbi->options.fs_gid))) || +#else + (attr->ia_uid != sbi->options.fs_uid)) || + ((attr->ia_valid & ATTR_GID) && + (attr->ia_gid != sbi->options.fs_gid)) || +#endif + ((attr->ia_valid & ATTR_MODE) && + (attr->ia_mode & ~(S_IFREG | S_IFLNK | S_IFDIR | S_IRWXUGO)))) { + return -EPERM; + } + + /* + * We don't return -EPERM here. Yes, strange, but this is too + * old behavior. + */ + if (attr->ia_valid & ATTR_MODE) { + if (exfat_sanitize_mode(sbi, inode, &attr->ia_mode) < 0) + attr->ia_valid &= ~ATTR_MODE; + } + + EXFAT_I(inode)->fid.size = i_size_read(inode); + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) + if (attr->ia_valid) + error = inode_setattr(inode, attr); +#else + if (attr->ia_valid & ATTR_SIZE) { + old_size = i_size_read(inode); +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,4,00) + down_write(&EXFAT_I(inode)->truncate_lock); + truncate_setsize(inode, attr->ia_size); + _exfat_truncate(inode, old_size); + up_write(&EXFAT_I(inode)->truncate_lock); +#else + truncate_setsize(inode, attr->ia_size); + _exfat_truncate(inode, old_size); +#endif + } + setattr_copy(inode, attr); + mark_inode_dirty(inode); +#endif + + DPRINTK("exfat_setattr exited\n"); + return error; +} + +static int exfat_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) +{ + struct inode *inode = dentry->d_inode; + + DPRINTK("exfat_getattr entered\n"); + + generic_fillattr(inode, stat); + stat->blksize = EXFAT_SB(inode->i_sb)->fs_info.cluster_size; + + DPRINTK("exfat_getattr exited\n"); + return 0; +} + +const struct inode_operations exfat_dir_inode_operations = { + .create = exfat_create, + .lookup = exfat_lookup, + .unlink = exfat_unlink, + .symlink = exfat_symlink, + .mkdir = exfat_mkdir, + .rmdir = exfat_rmdir, + .rename = exfat_rename, + .setattr = exfat_setattr, + .getattr = exfat_getattr, +}; + +/*======================================================================*/ +/* File Operations */ +/*======================================================================*/ +#if LINUX_VERSION_CODE > KERNEL_VERSION(4,1,0) +static const char *exfat_follow_link(struct dentry *dentry, void **cookie) +{ + struct exfat_inode_info *ei = EXFAT_I(dentry->d_inode); + return *cookie = (char *)(ei->target); +} +#else +static void *exfat_follow_link(struct dentry *dentry, struct nameidata *nd) +{ + struct exfat_inode_info *ei = EXFAT_I(dentry->d_inode); + nd_set_link(nd, (char *)(ei->target)); + return NULL; +} +#endif + +const struct inode_operations exfat_symlink_inode_operations = { + .readlink = generic_readlink, + .follow_link = exfat_follow_link, +}; + +static int exfat_file_release(struct inode *inode, struct file *filp) +{ + struct super_block *sb = inode->i_sb; + + EXFAT_I(inode)->fid.size = i_size_read(inode); + FsSyncVol(sb, 0); + return 0; +} + +const struct file_operations exfat_file_operations = { + .llseek = generic_file_llseek, +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,16,0) + .read = do_sync_read, + .write = do_sync_write, + .aio_read = generic_file_aio_read, + .aio_write = generic_file_aio_write, +#elif LINUX_VERSION_CODE < KERNEL_VERSION(4,1,0) + .read = new_sync_read, + .write = new_sync_write, +#endif +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,16,0) + .read_iter = generic_file_read_iter, + .write_iter = generic_file_write_iter, +#endif + .mmap = generic_file_mmap, + .release = exfat_file_release, +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) + .ioctl = exfat_generic_ioctl, + .fsync = exfat_file_fsync, +#else + .unlocked_ioctl = exfat_generic_ioctl, + .fsync = generic_file_fsync, +#endif + .splice_read = generic_file_splice_read, +}; + +static void _exfat_truncate(struct inode *inode, loff_t old_size) +{ + struct super_block *sb = inode->i_sb; + struct exfat_sb_info *sbi = EXFAT_SB(sb); + FS_INFO_T *p_fs = &(sbi->fs_info); + int err; + + __lock_super(sb); + + /* + * This protects against truncating a file bigger than it was then + * trying to write into the hole. + */ + if (EXFAT_I(inode)->mmu_private > i_size_read(inode)) + EXFAT_I(inode)->mmu_private = i_size_read(inode); + + if (EXFAT_I(inode)->fid.start_clu == 0) + goto out; + + err = FsTruncateFile(inode, old_size, i_size_read(inode)); + if (err) + goto out; + + inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC; + if (IS_DIRSYNC(inode)) + (void) exfat_sync_inode(inode); + else + mark_inode_dirty(inode); + + inode->i_blocks = ((i_size_read(inode) + (p_fs->cluster_size - 1)) + & ~((loff_t)p_fs->cluster_size - 1)) >> 9; +out: + __unlock_super(sb); +} + +#if LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,36) +static void exfat_truncate(struct inode *inode) +{ + _exfat_truncate(inode, i_size_read(inode)); +} +#endif + +const struct inode_operations exfat_file_inode_operations = { +#if LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,36) + .truncate = exfat_truncate, +#endif + .setattr = exfat_setattr, + .getattr = exfat_getattr, +}; + +/*======================================================================*/ +/* Address Space Operations */ +/*======================================================================*/ + +static int exfat_bmap(struct inode *inode, sector_t sector, sector_t *phys, + unsigned long *mapped_blocks, int *create) +{ + struct super_block *sb = inode->i_sb; + struct exfat_sb_info *sbi = EXFAT_SB(sb); + FS_INFO_T *p_fs = &(sbi->fs_info); + BD_INFO_T *p_bd = &(sbi->bd_info); + const unsigned long blocksize = sb->s_blocksize; + const unsigned char blocksize_bits = sb->s_blocksize_bits; + sector_t last_block; + int err, clu_offset, sec_offset; + unsigned int cluster; + + *phys = 0; + *mapped_blocks = 0; + + if ((p_fs->vol_type == FAT12) || (p_fs->vol_type == FAT16)) { + if (inode->i_ino == EXFAT_ROOT_INO) { + if (sector < (p_fs->dentries_in_root >> (p_bd->sector_size_bits-DENTRY_SIZE_BITS))) { + *phys = sector + p_fs->root_start_sector; + *mapped_blocks = 1; + } + return 0; + } + } + + last_block = (i_size_read(inode) + (blocksize - 1)) >> blocksize_bits; + if (sector >= last_block) { + if (*create == 0) + return 0; + } else { + *create = 0; + } + + clu_offset = sector >> p_fs->sectors_per_clu_bits; /* cluster offset */ + sec_offset = sector & (p_fs->sectors_per_clu - 1); /* sector offset in cluster */ + + EXFAT_I(inode)->fid.size = i_size_read(inode); + + err = FsMapCluster(inode, clu_offset, &cluster); + + if (err) { + if (err == FFS_FULL) + return -ENOSPC; + else + return -EIO; + } else if (cluster != CLUSTER_32(~0)) { + *phys = START_SECTOR(cluster) + sec_offset; + *mapped_blocks = p_fs->sectors_per_clu - sec_offset; + } + + return 0; +} + +static int exfat_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) +{ + struct super_block *sb = inode->i_sb; + unsigned long max_blocks = bh_result->b_size >> inode->i_blkbits; + int err; + unsigned long mapped_blocks; + sector_t phys; + + __lock_super(sb); + + err = exfat_bmap(inode, iblock, &phys, &mapped_blocks, &create); + if (err) { + __unlock_super(sb); + return err; + } + + if (phys) { + max_blocks = min(mapped_blocks, max_blocks); + if (create) { + EXFAT_I(inode)->mmu_private += max_blocks << sb->s_blocksize_bits; + set_buffer_new(bh_result); + } + map_bh(bh_result, sb, phys); + } + + bh_result->b_size = max_blocks << sb->s_blocksize_bits; + __unlock_super(sb); + + return 0; +} + +static int exfat_readpage(struct file *file, struct page *page) +{ + int ret; + ret = mpage_readpage(page, exfat_get_block); + return ret; +} + +static int exfat_readpages(struct file *file, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages) +{ + int ret; + ret = mpage_readpages(mapping, pages, nr_pages, exfat_get_block); + return ret; +} + +static int exfat_writepage(struct page *page, struct writeback_control *wbc) +{ + int ret; + ret = block_write_full_page(page, exfat_get_block, wbc); + return ret; +} + +static int exfat_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + int ret; + ret = mpage_writepages(mapping, wbc, exfat_get_block); + return ret; +} + +#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,34) +static void exfat_write_failed(struct address_space *mapping, loff_t to) +{ + struct inode *inode = mapping->host; + if (to > i_size_read(inode)) { +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,12,0) + truncate_pagecache(inode, i_size_read(inode)); +#else + truncate_pagecache(inode, to, i_size_read(inode)); +#endif + EXFAT_I(inode)->fid.size = i_size_read(inode); + _exfat_truncate(inode, i_size_read(inode)); + } +} +#endif + +static int exfat_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + int ret; + *pagep = NULL; + ret = cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + exfat_get_block, + &EXFAT_I(mapping->host)->mmu_private); + +#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,34) + if (ret < 0) + exfat_write_failed(mapping, pos+len); +#endif + return ret; +} + +static int exfat_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *pagep, void *fsdata) +{ + struct inode *inode = mapping->host; + FILE_ID_T *fid = &(EXFAT_I(inode)->fid); + int err; + + err = generic_write_end(file, mapping, pos, len, copied, pagep, fsdata); + +#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,34) + if (err < len) + exfat_write_failed(mapping, pos+len); +#endif + + if (!(err < 0) && !(fid->attr & ATTR_ARCHIVE)) { + inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC; + fid->attr |= ATTR_ARCHIVE; + mark_inode_dirty(inode); + } + return err; +} + +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,16,0) +#ifdef CONFIG_AIO_OPTIMIZATION +static ssize_t exfat_direct_IO(int rw, struct kiocb *iocb, + struct iov_iter *iter, loff_t offset) +#else +static ssize_t exfat_direct_IO(int rw, struct kiocb *iocb, + const struct iovec *iov, + loff_t offset, unsigned long nr_segs) +#endif +#elif LINUX_VERSION_CODE < KERNEL_VERSION(4,2,0) +static ssize_t exfat_direct_IO(int rw, struct kiocb *iocb, + struct iov_iter *iter, loff_t offset) +#else /* >= 4.1.x */ +static ssize_t exfat_direct_IO(struct kiocb *iocb, + struct iov_iter *iter, loff_t offset) +#endif +{ + struct inode *inode = iocb->ki_filp->f_mapping->host; +#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,34) + struct address_space *mapping = iocb->ki_filp->f_mapping; +#endif + ssize_t ret; +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,2,0) + int rw; + + rw = iov_iter_rw(iter); +#endif + + if (rw == WRITE) { +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,16,0) +#ifdef CONFIG_AIO_OPTIMIZATION + if (EXFAT_I(inode)->mmu_private < + (offset + iov_iter_count(iter))) +#else + if (EXFAT_I(inode)->mmu_private < (offset + iov_length(iov, nr_segs))) +#endif +#else + if (EXFAT_I(inode)->mmu_private < (offset + iov_iter_count(iter))) +#endif + return 0; + } +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,1,0) + ret = blockdev_direct_IO(iocb, inode, iter, + offset, exfat_get_block); +#elif LINUX_VERSION_CODE >= KERNEL_VERSION(3,16,0) + ret = blockdev_direct_IO(rw, iocb, inode, iter, + offset, exfat_get_block); +#elif LINUX_VERSION_CODE >= KERNEL_VERSION(3,1,0) +#ifdef CONFIG_AIO_OPTIMIZATION + ret = blockdev_direct_IO(rw, iocb, inode, iter, + offset, exfat_get_block); +#else + ret = blockdev_direct_IO(rw, iocb, inode, iov, + offset, nr_segs, exfat_get_block); +#endif +#else + ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, + offset, nr_segs, exfat_get_block, NULL); +#endif + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,16,0) + if ((ret < 0) && (rw & WRITE)) + exfat_write_failed(mapping, offset+iov_iter_count(iter)); +#elif LINUX_VERSION_CODE > KERNEL_VERSION(2,6,34) + if ((ret < 0) && (rw & WRITE)) +#ifdef CONFIG_AIO_OPTIMIZATION + exfat_write_failed(mapping, offset+iov_iter_count(iter)); +#else + exfat_write_failed(mapping, offset+iov_length(iov, nr_segs)); +#endif +#endif + return ret; +} + +static sector_t _exfat_bmap(struct address_space *mapping, sector_t block) +{ + sector_t blocknr; + + /* exfat_get_cluster() assumes the requested blocknr isn't truncated. */ +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,4,00) + down_read(&EXFAT_I(mapping->host)->truncate_lock); + blocknr = generic_block_bmap(mapping, block, exfat_get_block); + up_read(&EXFAT_I(mapping->host)->truncate_lock); +#else + down_read(&EXFAT_I(mapping->host)->i_alloc_sem); + blocknr = generic_block_bmap(mapping, block, exfat_get_block); + up_read(&EXFAT_I(mapping->host)->i_alloc_sem); +#endif + + return blocknr; +} + +const struct address_space_operations exfat_aops = { + .readpage = exfat_readpage, + .readpages = exfat_readpages, + .writepage = exfat_writepage, + .writepages = exfat_writepages, +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,39) + .sync_page = block_sync_page, +#endif + .write_begin = exfat_write_begin, + .write_end = exfat_write_end, + .direct_IO = exfat_direct_IO, + .bmap = _exfat_bmap +}; + +/*======================================================================*/ +/* Super Operations */ +/*======================================================================*/ + +static inline unsigned long exfat_hash(loff_t i_pos) +{ + return hash_32(i_pos, EXFAT_HASH_BITS); +} + +static struct inode *exfat_iget(struct super_block *sb, loff_t i_pos) +{ + struct exfat_sb_info *sbi = EXFAT_SB(sb); + struct exfat_inode_info *info; + struct hlist_head *head = sbi->inode_hashtable + exfat_hash(i_pos); + struct inode *inode = NULL; +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,9,0) + struct hlist_node *node; + + spin_lock(&sbi->inode_hash_lock); + hlist_for_each_entry(info, node, head, i_hash_fat) { +#else + spin_lock(&sbi->inode_hash_lock); + hlist_for_each_entry(info, head, i_hash_fat) { +#endif + CHECK_ERR(info->vfs_inode.i_sb != sb); + + if (i_pos != info->i_pos) + continue; + inode = igrab(&info->vfs_inode); + if (inode) + break; + } + spin_unlock(&sbi->inode_hash_lock); + return inode; +} + +static void exfat_attach(struct inode *inode, loff_t i_pos) +{ + struct exfat_sb_info *sbi = EXFAT_SB(inode->i_sb); + struct hlist_head *head = sbi->inode_hashtable + exfat_hash(i_pos); + + spin_lock(&sbi->inode_hash_lock); + EXFAT_I(inode)->i_pos = i_pos; + hlist_add_head(&EXFAT_I(inode)->i_hash_fat, head); + spin_unlock(&sbi->inode_hash_lock); +} + +static void exfat_detach(struct inode *inode) +{ + struct exfat_sb_info *sbi = EXFAT_SB(inode->i_sb); + + spin_lock(&sbi->inode_hash_lock); + hlist_del_init(&EXFAT_I(inode)->i_hash_fat); + EXFAT_I(inode)->i_pos = 0; + spin_unlock(&sbi->inode_hash_lock); +} + +/* doesn't deal with root inode */ +static int exfat_fill_inode(struct inode *inode, FILE_ID_T *fid) +{ + struct exfat_sb_info *sbi = EXFAT_SB(inode->i_sb); + FS_INFO_T *p_fs = &(sbi->fs_info); + DIR_ENTRY_T info; + + memcpy(&(EXFAT_I(inode)->fid), fid, sizeof(FILE_ID_T)); + + FsReadStat(inode, &info); + + EXFAT_I(inode)->i_pos = 0; + EXFAT_I(inode)->target = NULL; + inode->i_uid = sbi->options.fs_uid; + inode->i_gid = sbi->options.fs_gid; + inode->i_version++; + inode->i_generation = get_seconds(); + + if (info.Attr & ATTR_SUBDIR) { /* directory */ + inode->i_generation &= ~1; + inode->i_mode = exfat_make_mode(sbi, info.Attr, S_IRWXUGO); + inode->i_op = &exfat_dir_inode_operations; + inode->i_fop = &exfat_dir_operations; + + i_size_write(inode, info.Size); + EXFAT_I(inode)->mmu_private = i_size_read(inode); +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,2,00) + set_nlink(inode, info.NumSubdirs); +#else + inode->i_nlink = info.NumSubdirs; +#endif + } else if (info.Attr & ATTR_SYMLINK) { /* symbolic link */ + inode->i_generation |= 1; + inode->i_mode = exfat_make_mode(sbi, info.Attr, S_IRWXUGO); + inode->i_op = &exfat_symlink_inode_operations; + + i_size_write(inode, info.Size); + EXFAT_I(inode)->mmu_private = i_size_read(inode); + } else { /* regular file */ + inode->i_generation |= 1; + inode->i_mode = exfat_make_mode(sbi, info.Attr, S_IRWXUGO); + inode->i_op = &exfat_file_inode_operations; + inode->i_fop = &exfat_file_operations; + inode->i_mapping->a_ops = &exfat_aops; + inode->i_mapping->nrpages = 0; + + i_size_write(inode, info.Size); + EXFAT_I(inode)->mmu_private = i_size_read(inode); + } + exfat_save_attr(inode, info.Attr); + + inode->i_blocks = ((i_size_read(inode) + (p_fs->cluster_size - 1)) + & ~((loff_t)p_fs->cluster_size - 1)) >> 9; + + exfat_time_fat2unix(sbi, &inode->i_mtime, &info.ModifyTimestamp); + exfat_time_fat2unix(sbi, &inode->i_ctime, &info.CreateTimestamp); + exfat_time_fat2unix(sbi, &inode->i_atime, &info.AccessTimestamp); + + return 0; +} + +static struct inode *exfat_build_inode(struct super_block *sb, + FILE_ID_T *fid, loff_t i_pos) { + struct inode *inode; + int err; + + inode = exfat_iget(sb, i_pos); + if (inode) + goto out; + inode = new_inode(sb); + if (!inode) { + inode = ERR_PTR(-ENOMEM); + goto out; + } + inode->i_ino = iunique(sb, EXFAT_ROOT_INO); + inode->i_version = 1; + err = exfat_fill_inode(inode, fid); + if (err) { + iput(inode); + inode = ERR_PTR(err); + goto out; + } + exfat_attach(inode, i_pos); + insert_inode_hash(inode); +out: + return inode; +} + +static int exfat_sync_inode(struct inode *inode) +{ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,34) + return exfat_write_inode(inode, 0); +#else + return exfat_write_inode(inode, NULL); +#endif +} + +static struct inode *exfat_alloc_inode(struct super_block *sb) +{ + struct exfat_inode_info *ei; + + ei = kmem_cache_alloc(exfat_inode_cachep, GFP_NOFS); + if (!ei) + return NULL; + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,4,00) + init_rwsem(&ei->truncate_lock); +#endif + + return &ei->vfs_inode; +} + +static void exfat_destroy_inode(struct inode *inode) +{ + if (EXFAT_I(inode)->target) + kfree(EXFAT_I(inode)->target); + EXFAT_I(inode)->target = NULL; + + kmem_cache_free(exfat_inode_cachep, EXFAT_I(inode)); +} + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,34) +static int exfat_write_inode(struct inode *inode, int wait) +#else +static int exfat_write_inode(struct inode *inode, struct writeback_control *wbc) +#endif +{ + struct super_block *sb = inode->i_sb; + struct exfat_sb_info *sbi = EXFAT_SB(sb); + DIR_ENTRY_T info; + + if (inode->i_ino == EXFAT_ROOT_INO) + return 0; + + info.Attr = exfat_make_attr(inode); + info.Size = i_size_read(inode); + + exfat_time_unix2fat(sbi, &inode->i_mtime, &info.ModifyTimestamp); + exfat_time_unix2fat(sbi, &inode->i_ctime, &info.CreateTimestamp); + exfat_time_unix2fat(sbi, &inode->i_atime, &info.AccessTimestamp); + + FsWriteStat(inode, &info); + + return 0; +} + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) +static void exfat_delete_inode(struct inode *inode) +{ + truncate_inode_pages(&inode->i_data, 0); + clear_inode(inode); +} + +static void exfat_clear_inode(struct inode *inode) +{ + exfat_detach(inode); + remove_inode_hash(inode); +} +#else +static void exfat_evict_inode(struct inode *inode) +{ + truncate_inode_pages(&inode->i_data, 0); + + if (!inode->i_nlink) + i_size_write(inode, 0); + invalidate_inode_buffers(inode); +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,4,80) + end_writeback(inode); +#else + clear_inode(inode); +#endif + exfat_detach(inode); + + remove_inode_hash(inode); +} +#endif + +static void exfat_free_super(struct exfat_sb_info *sbi) +{ + if (sbi->nls_disk) + unload_nls(sbi->nls_disk); + if (sbi->nls_io) + unload_nls(sbi->nls_io); + if (sbi->options.iocharset != exfat_default_iocharset) + kfree(sbi->options.iocharset); +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,0) + /* mutex_init is in exfat_fill_super function. only for 3.7+ */ + mutex_destroy(&sbi->s_lock); +#endif + kfree(sbi); +} + +static void exfat_put_super(struct super_block *sb) +{ + struct exfat_sb_info *sbi = EXFAT_SB(sb); + if (__is_sb_dirty(sb)) + exfat_write_super(sb); + + FsUmountVol(sb); + + sb->s_fs_info = NULL; + exfat_free_super(sbi); +} + +static void exfat_write_super(struct super_block *sb) +{ + __lock_super(sb); + + __set_sb_clean(sb); + + if (!(sb->s_flags & MS_RDONLY)) + FsSyncVol(sb, 1); + + __unlock_super(sb); +} + +static int exfat_sync_fs(struct super_block *sb, int wait) +{ + int err = 0; + + if (__is_sb_dirty(sb)) { + __lock_super(sb); + __set_sb_clean(sb); + err = FsSyncVol(sb, 1); + __unlock_super(sb); + } + + return err; +} + +static int exfat_statfs(struct dentry *dentry, struct kstatfs *buf) +{ + struct super_block *sb = dentry->d_sb; + u64 id = huge_encode_dev(sb->s_bdev->bd_dev); + FS_INFO_T *p_fs = &(EXFAT_SB(sb)->fs_info); + VOL_INFO_T info; + + if (p_fs->used_clusters == (u32) ~0) { + if (FFS_MEDIAERR == FsGetVolInfo(sb, &info)) + return -EIO; + + } else { + info.FatType = p_fs->vol_type; + info.ClusterSize = p_fs->cluster_size; + info.NumClusters = p_fs->num_clusters - 2; + info.UsedClusters = p_fs->used_clusters; + info.FreeClusters = info.NumClusters - info.UsedClusters; + + if (p_fs->dev_ejected) + printk("[EXFAT] statfs on device is ejected\n"); + } + + buf->f_type = sb->s_magic; + buf->f_bsize = info.ClusterSize; + buf->f_blocks = info.NumClusters; + buf->f_bfree = info.FreeClusters; + buf->f_bavail = info.FreeClusters; + buf->f_fsid.val[0] = (u32)id; + buf->f_fsid.val[1] = (u32)(id >> 32); + buf->f_namelen = 260; + + return 0; +} + +static int exfat_remount(struct super_block *sb, int *flags, char *data) +{ + *flags |= MS_NODIRATIME; + return 0; +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,3,0) +static int exfat_show_options(struct seq_file *m, struct dentry *root) +{ + struct exfat_sb_info *sbi = EXFAT_SB(root->d_sb); +#else +static int exfat_show_options(struct seq_file *m, struct vfsmount *mnt) +{ + struct exfat_sb_info *sbi = EXFAT_SB(mnt->mnt_sb); +#endif + struct exfat_mount_options *opts = &sbi->options; +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,5,0) + if (__kuid_val(opts->fs_uid)) + seq_printf(m, ",uid=%u", __kuid_val(opts->fs_uid)); + if (__kgid_val(opts->fs_gid)) + seq_printf(m, ",gid=%u", __kgid_val(opts->fs_gid)); +#else + if (opts->fs_uid != 0) + seq_printf(m, ",uid=%u", opts->fs_uid); + if (opts->fs_gid != 0) + seq_printf(m, ",gid=%u", opts->fs_gid); +#endif + seq_printf(m, ",fmask=%04o", opts->fs_fmask); + seq_printf(m, ",dmask=%04o", opts->fs_dmask); + if (opts->allow_utime) + seq_printf(m, ",allow_utime=%04o", opts->allow_utime); + if (sbi->nls_disk) + seq_printf(m, ",codepage=%s", sbi->nls_disk->charset); + if (sbi->nls_io) + seq_printf(m, ",iocharset=%s", sbi->nls_io->charset); + seq_printf(m, ",namecase=%u", opts->casesensitive); + if (opts->errors == EXFAT_ERRORS_CONT) + seq_puts(m, ",errors=continue"); + else if (opts->errors == EXFAT_ERRORS_PANIC) + seq_puts(m, ",errors=panic"); + else + seq_puts(m, ",errors=remount-ro"); +#ifdef CONFIG_EXFAT_DISCARD + if (opts->discard) + seq_printf(m, ",discard"); +#endif + return 0; +} + +const struct super_operations exfat_sops = { + .alloc_inode = exfat_alloc_inode, + .destroy_inode = exfat_destroy_inode, + .write_inode = exfat_write_inode, +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,36) + .delete_inode = exfat_delete_inode, + .clear_inode = exfat_clear_inode, +#else + .evict_inode = exfat_evict_inode, +#endif + .put_super = exfat_put_super, +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,7,0) + .write_super = exfat_write_super, +#endif + .sync_fs = exfat_sync_fs, + .statfs = exfat_statfs, + .remount_fs = exfat_remount, + .show_options = exfat_show_options, +}; + +/*======================================================================*/ +/* Super Block Read Operations */ +/*======================================================================*/ + +enum { + Opt_uid, + Opt_gid, + Opt_umask, + Opt_dmask, + Opt_fmask, + Opt_allow_utime, + Opt_codepage, + Opt_charset, + Opt_namecase, + Opt_debug, + Opt_err_cont, + Opt_err_panic, + Opt_err_ro, + Opt_utf8_hack, + Opt_err, +#ifdef CONFIG_EXFAT_DISCARD + Opt_discard, +#endif /* EXFAT_CONFIG_DISCARD */ +}; + +static const match_table_t exfat_tokens = { + {Opt_uid, "uid=%u"}, + {Opt_gid, "gid=%u"}, + {Opt_umask, "umask=%o"}, + {Opt_dmask, "dmask=%o"}, + {Opt_fmask, "fmask=%o"}, + {Opt_allow_utime, "allow_utime=%o"}, + {Opt_codepage, "codepage=%u"}, + {Opt_charset, "iocharset=%s"}, + {Opt_namecase, "namecase=%u"}, + {Opt_debug, "debug"}, + {Opt_err_cont, "errors=continue"}, + {Opt_err_panic, "errors=panic"}, + {Opt_err_ro, "errors=remount-ro"}, + {Opt_utf8_hack, "utf8"}, +#ifdef CONFIG_EXFAT_DISCARD + {Opt_discard, "discard"}, +#endif /* CONFIG_EXFAT_DISCARD */ + {Opt_err, NULL} +}; + +static int parse_options(char *options, int silent, int *debug, + struct exfat_mount_options *opts) +{ + char *p; + substring_t args[MAX_OPT_ARGS]; + int option; + char *iocharset; + + opts->fs_uid = current_uid(); + opts->fs_gid = current_gid(); + opts->fs_fmask = opts->fs_dmask = current->fs->umask; + opts->allow_utime = (unsigned short) -1; + opts->codepage = exfat_default_codepage; + opts->iocharset = exfat_default_iocharset; + opts->casesensitive = 0; + opts->errors = EXFAT_ERRORS_RO; +#ifdef CONFIG_EXFAT_DISCARD + opts->discard = 0; +#endif + *debug = 0; + + if (!options) + goto out; + + while ((p = strsep(&options, ",")) != NULL) { + int token; + if (!*p) + continue; + + token = match_token(p, exfat_tokens, args); + switch (token) { + case Opt_uid: + if (match_int(&args[0], &option)) + return 0; +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,5,0) + opts->fs_uid = KUIDT_INIT(option); +#else + opts->fs_uid = option; +#endif + break; + case Opt_gid: + if (match_int(&args[0], &option)) + return 0; +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,5,0) + opts->fs_gid = KGIDT_INIT(option); +#else + opts->fs_gid = option; +#endif + break; + case Opt_umask: + case Opt_dmask: + case Opt_fmask: + if (match_octal(&args[0], &option)) + return 0; + if (token != Opt_dmask) + opts->fs_fmask = option; + if (token != Opt_fmask) + opts->fs_dmask = option; + break; + case Opt_allow_utime: + if (match_octal(&args[0], &option)) + return 0; + opts->allow_utime = option & (S_IWGRP | S_IWOTH); + break; + case Opt_codepage: + if (match_int(&args[0], &option)) + return 0; + opts->codepage = option; + break; + case Opt_charset: + if (opts->iocharset != exfat_default_iocharset) + kfree(opts->iocharset); + iocharset = match_strdup(&args[0]); + if (!iocharset) + return -ENOMEM; + opts->iocharset = iocharset; + break; + case Opt_namecase: + if (match_int(&args[0], &option)) + return 0; + opts->casesensitive = option; + break; + case Opt_err_cont: + opts->errors = EXFAT_ERRORS_CONT; + break; + case Opt_err_panic: + opts->errors = EXFAT_ERRORS_PANIC; + break; + case Opt_err_ro: + opts->errors = EXFAT_ERRORS_RO; + break; + case Opt_debug: + *debug = 1; + break; +#ifdef CONFIG_EXFAT_DISCARD + case Opt_discard: + opts->discard = 1; + break; +#endif /* CONFIG_EXFAT_DISCARD */ + case Opt_utf8_hack: + break; + default: + if (!silent) + printk(KERN_ERR "[EXFAT] Unrecognized mount option %s or missing value\n", p); + return -EINVAL; + } + } + +out: + if (opts->allow_utime == (unsigned short) -1) + opts->allow_utime = ~opts->fs_dmask & (S_IWGRP | S_IWOTH); + + return 0; +} + +static void exfat_hash_init(struct super_block *sb) +{ + struct exfat_sb_info *sbi = EXFAT_SB(sb); + int i; + + spin_lock_init(&sbi->inode_hash_lock); + for (i = 0; i < EXFAT_HASH_SIZE; i++) + INIT_HLIST_HEAD(&sbi->inode_hashtable[i]); +} + +static int exfat_read_root(struct inode *inode) +{ + struct super_block *sb = inode->i_sb; + struct exfat_sb_info *sbi = EXFAT_SB(sb); + struct timespec ts; + FS_INFO_T *p_fs = &(sbi->fs_info); + DIR_ENTRY_T info; + + ts = CURRENT_TIME_SEC; + + EXFAT_I(inode)->fid.dir.dir = p_fs->root_dir; + EXFAT_I(inode)->fid.dir.flags = 0x01; + EXFAT_I(inode)->fid.entry = -1; + EXFAT_I(inode)->fid.start_clu = p_fs->root_dir; + EXFAT_I(inode)->fid.flags = 0x01; + EXFAT_I(inode)->fid.type = TYPE_DIR; + EXFAT_I(inode)->fid.rwoffset = 0; + EXFAT_I(inode)->fid.hint_last_off = -1; + + EXFAT_I(inode)->target = NULL; + + FsReadStat(inode, &info); + + inode->i_uid = sbi->options.fs_uid; + inode->i_gid = sbi->options.fs_gid; + inode->i_version++; + inode->i_generation = 0; + inode->i_mode = exfat_make_mode(sbi, ATTR_SUBDIR, S_IRWXUGO); + inode->i_op = &exfat_dir_inode_operations; + inode->i_fop = &exfat_dir_operations; + + i_size_write(inode, info.Size); + inode->i_blocks = ((i_size_read(inode) + (p_fs->cluster_size - 1)) + & ~((loff_t)p_fs->cluster_size - 1)) >> 9; + EXFAT_I(inode)->i_pos = ((loff_t) p_fs->root_dir << 32) | 0xffffffff; + EXFAT_I(inode)->mmu_private = i_size_read(inode); + + exfat_save_attr(inode, ATTR_SUBDIR); + inode->i_mtime = inode->i_atime = inode->i_ctime = ts; +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,2,00) + set_nlink(inode, info.NumSubdirs + 2); +#else + inode->i_nlink = info.NumSubdirs + 2; +#endif + + return 0; +} + +#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,37) +static void setup_dops(struct super_block *sb) +{ + if (EXFAT_SB(sb)->options.casesensitive == 0) + sb->s_d_op = &exfat_ci_dentry_ops; + else + sb->s_d_op = &exfat_dentry_ops; +} +#endif + +static int exfat_fill_super(struct super_block *sb, void *data, int silent) +{ + struct inode *root_inode = NULL; + struct exfat_sb_info *sbi; + int debug, ret; + long error; + char buf[50]; + + /* + * GFP_KERNEL is ok here, because while we do hold the + * supeblock lock, memory pressure can't call back into + * the filesystem, since we're only just about to mount + * it and have no inodes etc active! + */ + sbi = kzalloc(sizeof(struct exfat_sb_info), GFP_KERNEL); + if (!sbi) + return -ENOMEM; +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,0) + mutex_init(&sbi->s_lock); +#endif + sb->s_fs_info = sbi; + sb->s_flags |= MS_NODIRATIME; + sb->s_magic = EXFAT_SUPER_MAGIC; + sb->s_op = &exfat_sops; + + error = parse_options(data, silent, &debug, &sbi->options); + if (error) + goto out_fail; + +#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,37) + setup_dops(sb); +#endif + + error = -EIO; + sb_min_blocksize(sb, 512); + sb->s_maxbytes = 0x7fffffffffffffffLL; /* maximum file size */ + + ret = FsMountVol(sb); + if (ret) { + if (!silent) + printk(KERN_ERR "[EXFAT] FsMountVol failed\n"); + + goto out_fail; + } + + /* set up enough so that it can read an inode */ + exfat_hash_init(sb); + + /* + * The low byte of FAT's first entry must have same value with + * media-field. But in real world, too many devices is + * writing wrong value. So, removed that validity check. + * + * if (FAT_FIRST_ENT(sb, media) != first) + */ + + /* codepage is not meaningful in exfat */ + if (sbi->fs_info.vol_type != EXFAT) { + error = -EINVAL; + sprintf(buf, "cp%d", sbi->options.codepage); + sbi->nls_disk = load_nls(buf); + if (!sbi->nls_disk) { + printk(KERN_ERR "[EXFAT] Codepage %s not found\n", buf); + goto out_fail2; + } + } + + sbi->nls_io = load_nls(sbi->options.iocharset); + + error = -ENOMEM; + root_inode = new_inode(sb); + if (!root_inode) + goto out_fail2; + root_inode->i_ino = EXFAT_ROOT_INO; + root_inode->i_version = 1; + error = exfat_read_root(root_inode); + if (error < 0) + goto out_fail2; + error = -ENOMEM; + exfat_attach(root_inode, EXFAT_I(root_inode)->i_pos); + insert_inode_hash(root_inode); +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,4,00) + sb->s_root = d_make_root(root_inode); +#else + sb->s_root = d_alloc_root(root_inode); +#endif + if (!sb->s_root) { + printk(KERN_ERR "[EXFAT] Getting the root inode failed\n"); + goto out_fail2; + } + + return 0; + +out_fail2: + FsUmountVol(sb); +out_fail: + if (root_inode) + iput(root_inode); + sb->s_fs_info = NULL; + exfat_free_super(sbi); + return error; +} +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,37) +static int exfat_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, + void *data, struct vfsmount *mnt) +{ + return get_sb_bdev(fs_type, flags, dev_name, data, exfat_fill_super, mnt); +} +#else +static struct dentry *exfat_fs_mount(struct file_system_type *fs_type, + int flags, const char *dev_name, + void *data) { + return mount_bdev(fs_type, flags, dev_name, data, exfat_fill_super); +} +#endif + +static void init_once(void *foo) +{ + struct exfat_inode_info *ei = (struct exfat_inode_info *)foo; + + INIT_HLIST_NODE(&ei->i_hash_fat); + inode_init_once(&ei->vfs_inode); +} + +static int __init exfat_init_inodecache(void) +{ + exfat_inode_cachep = kmem_cache_create("exfat_inode_cache", + sizeof(struct exfat_inode_info), + 0, (SLAB_RECLAIM_ACCOUNT| + SLAB_MEM_SPREAD), + init_once); + if (exfat_inode_cachep == NULL) + return -ENOMEM; + return 0; +} + +static void __exit exfat_destroy_inodecache(void) +{ + kmem_cache_destroy(exfat_inode_cachep); +} + +#ifdef CONFIG_EXFAT_KERNEL_DEBUG +static void exfat_debug_kill_sb(struct super_block *sb) +{ + struct exfat_sb_info *sbi = EXFAT_SB(sb); + struct block_device *bdev = sb->s_bdev; + + long flags; + + if (sbi) { + flags = sbi->debug_flags; + + if (flags & EXFAT_DEBUGFLAGS_INVALID_UMOUNT) { + /* invalidate_bdev drops all device cache include dirty. + we use this to simulate device removal */ + FsReleaseCache(sb); + invalidate_bdev(bdev); + } + } + + kill_block_super(sb); +} +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ + +static struct file_system_type exfat_fs_type = { + .owner = THIS_MODULE, +#if defined(CONFIG_MACH_LGE) || defined(CONFIG_HTC_BATT_CORE) + .name = "texfat", +#else + .name = "exfat", +#endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,37) + .get_sb = exfat_get_sb, +#else + .mount = exfat_fs_mount, +#endif +#ifdef CONFIG_EXFAT_KERNEL_DEBUG + .kill_sb = exfat_debug_kill_sb, +#else + .kill_sb = kill_block_super, +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ + .fs_flags = FS_REQUIRES_DEV, +}; + +static int __init init_exfat(void) +{ + int err; + + err = FsInit(); + if (err) { + if (err == FFS_MEMORYERR) + return -ENOMEM; + else + return -EIO; + } + + printk(KERN_INFO "exFAT: Version %s\n", EXFAT_VERSION); + + err = exfat_init_inodecache(); + if (err) + goto out; + + err = register_filesystem(&exfat_fs_type); + if (err) + goto out; + + return 0; +out: + FsShutdown(); + return err; +} + +static void __exit exit_exfat(void) +{ + exfat_destroy_inodecache(); + unregister_filesystem(&exfat_fs_type); + FsShutdown(); +} + +module_init(init_exfat); +module_exit(exit_exfat); + +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("exFAT Filesystem Driver"); +#ifdef MODULE_ALIAS_FS +#if defined(CONFIG_MACH_LGE) || defined(CONFIG_HTC_BATT_CORE) +MODULE_ALIAS_FS("texfat"); +#else +MODULE_ALIAS_FS("exfat"); +#endif +#endif diff --git b/fs/exfat/exfat_super.h b/fs/exfat/exfat_super.h new file mode 100644 index 0000000..916811e --- /dev/null +++ b/fs/exfat/exfat_super.h @@ -0,0 +1,171 @@ +/* Some of the source code in this file came from "linux/fs/fat/fat.h". */ + +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +#ifndef _EXFAT_LINUX_H +#define _EXFAT_LINUX_H + +#include +#include +#include +#include +#include +#include + +#include "exfat_config.h" +#include "exfat_data.h" +#include "exfat_oal.h" + +#include "exfat_blkdev.h" +#include "exfat_cache.h" +#include "exfat_nls.h" +#include "exfat_api.h" +#include "exfat_core.h" + +#define EXFAT_ERRORS_CONT 1 /* ignore error and continue */ +#define EXFAT_ERRORS_PANIC 2 /* panic on error */ +#define EXFAT_ERRORS_RO 3 /* remount r/o on error */ + +/* ioctl command */ +#define EXFAT_IOCTL_GET_VOLUME_ID _IOR('r', 0x12, __u32) + +struct exfat_mount_options { +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,5,0) + kuid_t fs_uid; + kgid_t fs_gid; +#else + uid_t fs_uid; + gid_t fs_gid; +#endif + unsigned short fs_fmask; + unsigned short fs_dmask; + unsigned short allow_utime; /* permission for setting the [am]time */ + unsigned short codepage; /* codepage for shortname conversions */ + char *iocharset; /* charset for filename input/display */ + unsigned char casesensitive; + unsigned char errors; /* on error: continue, panic, remount-ro */ +#ifdef CONFIG_EXFAT_DISCARD + unsigned char discard; /* flag on if -o dicard specified and device support discard() */ +#endif /* CONFIG_EXFAT_DISCARD */ +}; + +#define EXFAT_HASH_BITS 8 +#define EXFAT_HASH_SIZE (1UL << EXFAT_HASH_BITS) + +/* + * EXFAT file system in-core superblock data + */ +struct exfat_sb_info { + FS_INFO_T fs_info; + BD_INFO_T bd_info; + + struct exfat_mount_options options; + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,7,00) + int s_dirt; + struct mutex s_lock; +#endif + struct nls_table *nls_disk; /* Codepage used on disk */ + struct nls_table *nls_io; /* Charset used for input and display */ + + struct inode *fat_inode; + + spinlock_t inode_hash_lock; + struct hlist_head inode_hashtable[EXFAT_HASH_SIZE]; +#ifdef CONFIG_EXFAT_KERNEL_DEBUG + long debug_flags; +#endif /* CONFIG_EXFAT_KERNEL_DEBUG */ +}; + +/* + * EXFAT file system inode data in memory + */ +struct exfat_inode_info { + FILE_ID_T fid; + char *target; + /* NOTE: mmu_private is 64bits, so must hold ->i_mutex to access */ + loff_t mmu_private; /* physically allocated size */ + loff_t i_pos; /* on-disk position of directory entry or 0 */ + struct hlist_node i_hash_fat; /* hash by i_location */ +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,4,00) + struct rw_semaphore truncate_lock; +#endif + struct inode vfs_inode; + struct rw_semaphore i_alloc_sem; /* protect bmap against truncate */ +}; + +#define EXFAT_SB(sb) ((struct exfat_sb_info *)((sb)->s_fs_info)) + +static inline struct exfat_inode_info *EXFAT_I(struct inode *inode) +{ + return container_of(inode, struct exfat_inode_info, vfs_inode); +} + +/* + * If ->i_mode can't hold S_IWUGO (i.e. ATTR_RO), we use ->i_attrs to + * save ATTR_RO instead of ->i_mode. + * + * If it's directory and !sbi->options.rodir, ATTR_RO isn't read-only + * bit, it's just used as flag for app. + */ +static inline int exfat_mode_can_hold_ro(struct inode *inode) +{ + struct exfat_sb_info *sbi = EXFAT_SB(inode->i_sb); + + if (S_ISDIR(inode->i_mode)) + return 0; + + if ((~sbi->options.fs_fmask) & S_IWUGO) + return 1; + return 0; +} + +/* Convert attribute bits and a mask to the UNIX mode. */ +static inline mode_t exfat_make_mode(struct exfat_sb_info *sbi, + u32 attr, mode_t mode) +{ + if ((attr & ATTR_READONLY) && !(attr & ATTR_SUBDIR)) + mode &= ~S_IWUGO; + + if (attr & ATTR_SUBDIR) + return (mode & ~sbi->options.fs_dmask) | S_IFDIR; + else if (attr & ATTR_SYMLINK) + return (mode & ~sbi->options.fs_dmask) | S_IFLNK; + else + return (mode & ~sbi->options.fs_fmask) | S_IFREG; +} + +/* Return the FAT attribute byte for this inode */ +static inline u32 exfat_make_attr(struct inode *inode) +{ + if (exfat_mode_can_hold_ro(inode) && !(inode->i_mode & S_IWUGO)) + return (EXFAT_I(inode)->fid.attr) | ATTR_READONLY; + else + return EXFAT_I(inode)->fid.attr; +} + +static inline void exfat_save_attr(struct inode *inode, u32 attr) +{ + if (exfat_mode_can_hold_ro(inode)) + EXFAT_I(inode)->fid.attr = attr & ATTR_RWMASK; + else + EXFAT_I(inode)->fid.attr = attr & (ATTR_RWMASK | ATTR_READONLY); +} + +#endif /* _EXFAT_LINUX_H */ diff --git b/fs/exfat/exfat_upcase.c b/fs/exfat/exfat_upcase.c new file mode 100644 index 0000000..3807f37 --- /dev/null +++ b/fs/exfat/exfat_upcase.c @@ -0,0 +1,405 @@ +/* + * Copyright (C) 2012-2013 Samsung Electronics Co., Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_upcase.c */ +/* PURPOSE : exFAT Up-case Table */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY (Ver 0.9) */ +/* */ +/* - 2010.11.15 [Joosun Hahn] : first writing */ +/* */ +/************************************************************************/ + +#include "exfat_config.h" + +#include "exfat_nls.h" + +const u8 uni_upcase[NUM_UPCASE<<1] = { + 0x00, 0x00, 0x01, 0x00, 0x02, 0x00, 0x03, 0x00, 0x04, 0x00, 0x05, 0x00, 0x06, 0x00, 0x07, 0x00, + 0x08, 0x00, 0x09, 0x00, 0x0A, 0x00, 0x0B, 0x00, 0x0C, 0x00, 0x0D, 0x00, 0x0E, 0x00, 0x0F, 0x00, + 0x10, 0x00, 0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x15, 0x00, 0x16, 0x00, 0x17, 0x00, + 0x18, 0x00, 0x19, 0x00, 0x1A, 0x00, 0x1B, 0x00, 0x1C, 0x00, 0x1D, 0x00, 0x1E, 0x00, 0x1F, 0x00, + 0x20, 0x00, 0x21, 0x00, 0x22, 0x00, 0x23, 0x00, 0x24, 0x00, 0x25, 0x00, 0x26, 0x00, 0x27, 0x00, + 0x28, 0x00, 0x29, 0x00, 0x2A, 0x00, 0x2B, 0x00, 0x2C, 0x00, 0x2D, 0x00, 0x2E, 0x00, 0x2F, 0x00, + 0x30, 0x00, 0x31, 0x00, 0x32, 0x00, 0x33, 0x00, 0x34, 0x00, 0x35, 0x00, 0x36, 0x00, 0x37, 0x00, + 0x38, 0x00, 0x39, 0x00, 0x3A, 0x00, 0x3B, 0x00, 0x3C, 0x00, 0x3D, 0x00, 0x3E, 0x00, 0x3F, 0x00, + 0x40, 0x00, 0x41, 0x00, 0x42, 0x00, 0x43, 0x00, 0x44, 0x00, 0x45, 0x00, 0x46, 0x00, 0x47, 0x00, + 0x48, 0x00, 0x49, 0x00, 0x4A, 0x00, 0x4B, 0x00, 0x4C, 0x00, 0x4D, 0x00, 0x4E, 0x00, 0x4F, 0x00, + 0x50, 0x00, 0x51, 0x00, 0x52, 0x00, 0x53, 0x00, 0x54, 0x00, 0x55, 0x00, 0x56, 0x00, 0x57, 0x00, + 0x58, 0x00, 0x59, 0x00, 0x5A, 0x00, 0x5B, 0x00, 0x5C, 0x00, 0x5D, 0x00, 0x5E, 0x00, 0x5F, 0x00, + 0x60, 0x00, 0x41, 0x00, 0x42, 0x00, 0x43, 0x00, 0x44, 0x00, 0x45, 0x00, 0x46, 0x00, 0x47, 0x00, + 0x48, 0x00, 0x49, 0x00, 0x4A, 0x00, 0x4B, 0x00, 0x4C, 0x00, 0x4D, 0x00, 0x4E, 0x00, 0x4F, 0x00, + 0x50, 0x00, 0x51, 0x00, 0x52, 0x00, 0x53, 0x00, 0x54, 0x00, 0x55, 0x00, 0x56, 0x00, 0x57, 0x00, + 0x58, 0x00, 0x59, 0x00, 0x5A, 0x00, 0x7B, 0x00, 0x7C, 0x00, 0x7D, 0x00, 0x7E, 0x00, 0x7F, 0x00, + 0x80, 0x00, 0x81, 0x00, 0x82, 0x00, 0x83, 0x00, 0x84, 0x00, 0x85, 0x00, 0x86, 0x00, 0x87, 0x00, + 0x88, 0x00, 0x89, 0x00, 0x8A, 0x00, 0x8B, 0x00, 0x8C, 0x00, 0x8D, 0x00, 0x8E, 0x00, 0x8F, 0x00, + 0x90, 0x00, 0x91, 0x00, 0x92, 0x00, 0x93, 0x00, 0x94, 0x00, 0x95, 0x00, 0x96, 0x00, 0x97, 0x00, + 0x98, 0x00, 0x99, 0x00, 0x9A, 0x00, 0x9B, 0x00, 0x9C, 0x00, 0x9D, 0x00, 0x9E, 0x00, 0x9F, 0x00, + 0xA0, 0x00, 0xA1, 0x00, 0xA2, 0x00, 0xA3, 0x00, 0xA4, 0x00, 0xA5, 0x00, 0xA6, 0x00, 0xA7, 0x00, + 0xA8, 0x00, 0xA9, 0x00, 0xAA, 0x00, 0xAB, 0x00, 0xAC, 0x00, 0xAD, 0x00, 0xAE, 0x00, 0xAF, 0x00, + 0xB0, 0x00, 0xB1, 0x00, 0xB2, 0x00, 0xB3, 0x00, 0xB4, 0x00, 0xB5, 0x00, 0xB6, 0x00, 0xB7, 0x00, + 0xB8, 0x00, 0xB9, 0x00, 0xBA, 0x00, 0xBB, 0x00, 0xBC, 0x00, 0xBD, 0x00, 0xBE, 0x00, 0xBF, 0x00, + 0xC0, 0x00, 0xC1, 0x00, 0xC2, 0x00, 0xC3, 0x00, 0xC4, 0x00, 0xC5, 0x00, 0xC6, 0x00, 0xC7, 0x00, + 0xC8, 0x00, 0xC9, 0x00, 0xCA, 0x00, 0xCB, 0x00, 0xCC, 0x00, 0xCD, 0x00, 0xCE, 0x00, 0xCF, 0x00, + 0xD0, 0x00, 0xD1, 0x00, 0xD2, 0x00, 0xD3, 0x00, 0xD4, 0x00, 0xD5, 0x00, 0xD6, 0x00, 0xD7, 0x00, + 0xD8, 0x00, 0xD9, 0x00, 0xDA, 0x00, 0xDB, 0x00, 0xDC, 0x00, 0xDD, 0x00, 0xDE, 0x00, 0xDF, 0x00, + 0xC0, 0x00, 0xC1, 0x00, 0xC2, 0x00, 0xC3, 0x00, 0xC4, 0x00, 0xC5, 0x00, 0xC6, 0x00, 0xC7, 0x00, + 0xC8, 0x00, 0xC9, 0x00, 0xCA, 0x00, 0xCB, 0x00, 0xCC, 0x00, 0xCD, 0x00, 0xCE, 0x00, 0xCF, 0x00, + 0xD0, 0x00, 0xD1, 0x00, 0xD2, 0x00, 0xD3, 0x00, 0xD4, 0x00, 0xD5, 0x00, 0xD6, 0x00, 0xF7, 0x00, + 0xD8, 0x00, 0xD9, 0x00, 0xDA, 0x00, 0xDB, 0x00, 0xDC, 0x00, 0xDD, 0x00, 0xDE, 0x00, 0x78, 0x01, + 0x00, 0x01, 0x00, 0x01, 0x02, 0x01, 0x02, 0x01, 0x04, 0x01, 0x04, 0x01, 0x06, 0x01, 0x06, 0x01, + 0x08, 0x01, 0x08, 0x01, 0x0A, 0x01, 0x0A, 0x01, 0x0C, 0x01, 0x0C, 0x01, 0x0E, 0x01, 0x0E, 0x01, + 0x10, 0x01, 0x10, 0x01, 0x12, 0x01, 0x12, 0x01, 0x14, 0x01, 0x14, 0x01, 0x16, 0x01, 0x16, 0x01, + 0x18, 0x01, 0x18, 0x01, 0x1A, 0x01, 0x1A, 0x01, 0x1C, 0x01, 0x1C, 0x01, 0x1E, 0x01, 0x1E, 0x01, + 0x20, 0x01, 0x20, 0x01, 0x22, 0x01, 0x22, 0x01, 0x24, 0x01, 0x24, 0x01, 0x26, 0x01, 0x26, 0x01, + 0x28, 0x01, 0x28, 0x01, 0x2A, 0x01, 0x2A, 0x01, 0x2C, 0x01, 0x2C, 0x01, 0x2E, 0x01, 0x2E, 0x01, + 0x30, 0x01, 0x31, 0x01, 0x32, 0x01, 0x32, 0x01, 0x34, 0x01, 0x34, 0x01, 0x36, 0x01, 0x36, 0x01, + 0x38, 0x01, 0x39, 0x01, 0x39, 0x01, 0x3B, 0x01, 0x3B, 0x01, 0x3D, 0x01, 0x3D, 0x01, 0x3F, 0x01, + 0x3F, 0x01, 0x41, 0x01, 0x41, 0x01, 0x43, 0x01, 0x43, 0x01, 0x45, 0x01, 0x45, 0x01, 0x47, 0x01, + 0x47, 0x01, 0x49, 0x01, 0x4A, 0x01, 0x4A, 0x01, 0x4C, 0x01, 0x4C, 0x01, 0x4E, 0x01, 0x4E, 0x01, + 0x50, 0x01, 0x50, 0x01, 0x52, 0x01, 0x52, 0x01, 0x54, 0x01, 0x54, 0x01, 0x56, 0x01, 0x56, 0x01, + 0x58, 0x01, 0x58, 0x01, 0x5A, 0x01, 0x5A, 0x01, 0x5C, 0x01, 0x5C, 0x01, 0x5E, 0x01, 0x5E, 0x01, + 0x60, 0x01, 0x60, 0x01, 0x62, 0x01, 0x62, 0x01, 0x64, 0x01, 0x64, 0x01, 0x66, 0x01, 0x66, 0x01, + 0x68, 0x01, 0x68, 0x01, 0x6A, 0x01, 0x6A, 0x01, 0x6C, 0x01, 0x6C, 0x01, 0x6E, 0x01, 0x6E, 0x01, + 0x70, 0x01, 0x70, 0x01, 0x72, 0x01, 0x72, 0x01, 0x74, 0x01, 0x74, 0x01, 0x76, 0x01, 0x76, 0x01, + 0x78, 0x01, 0x79, 0x01, 0x79, 0x01, 0x7B, 0x01, 0x7B, 0x01, 0x7D, 0x01, 0x7D, 0x01, 0x7F, 0x01, + 0x43, 0x02, 0x81, 0x01, 0x82, 0x01, 0x82, 0x01, 0x84, 0x01, 0x84, 0x01, 0x86, 0x01, 0x87, 0x01, + 0x87, 0x01, 0x89, 0x01, 0x8A, 0x01, 0x8B, 0x01, 0x8B, 0x01, 0x8D, 0x01, 0x8E, 0x01, 0x8F, 0x01, + 0x90, 0x01, 0x91, 0x01, 0x91, 0x01, 0x93, 0x01, 0x94, 0x01, 0xF6, 0x01, 0x96, 0x01, 0x97, 0x01, + 0x98, 0x01, 0x98, 0x01, 0x3D, 0x02, 0x9B, 0x01, 0x9C, 0x01, 0x9D, 0x01, 0x20, 0x02, 0x9F, 0x01, + 0xA0, 0x01, 0xA0, 0x01, 0xA2, 0x01, 0xA2, 0x01, 0xA4, 0x01, 0xA4, 0x01, 0xA6, 0x01, 0xA7, 0x01, + 0xA7, 0x01, 0xA9, 0x01, 0xAA, 0x01, 0xAB, 0x01, 0xAC, 0x01, 0xAC, 0x01, 0xAE, 0x01, 0xAF, 0x01, + 0xAF, 0x01, 0xB1, 0x01, 0xB2, 0x01, 0xB3, 0x01, 0xB3, 0x01, 0xB5, 0x01, 0xB5, 0x01, 0xB7, 0x01, + 0xB8, 0x01, 0xB8, 0x01, 0xBA, 0x01, 0xBB, 0x01, 0xBC, 0x01, 0xBC, 0x01, 0xBE, 0x01, 0xF7, 0x01, + 0xC0, 0x01, 0xC1, 0x01, 0xC2, 0x01, 0xC3, 0x01, 0xC4, 0x01, 0xC5, 0x01, 0xC4, 0x01, 0xC7, 0x01, + 0xC8, 0x01, 0xC7, 0x01, 0xCA, 0x01, 0xCB, 0x01, 0xCA, 0x01, 0xCD, 0x01, 0xCD, 0x01, 0xCF, 0x01, + 0xCF, 0x01, 0xD1, 0x01, 0xD1, 0x01, 0xD3, 0x01, 0xD3, 0x01, 0xD5, 0x01, 0xD5, 0x01, 0xD7, 0x01, + 0xD7, 0x01, 0xD9, 0x01, 0xD9, 0x01, 0xDB, 0x01, 0xDB, 0x01, 0x8E, 0x01, 0xDE, 0x01, 0xDE, 0x01, + 0xE0, 0x01, 0xE0, 0x01, 0xE2, 0x01, 0xE2, 0x01, 0xE4, 0x01, 0xE4, 0x01, 0xE6, 0x01, 0xE6, 0x01, + 0xE8, 0x01, 0xE8, 0x01, 0xEA, 0x01, 0xEA, 0x01, 0xEC, 0x01, 0xEC, 0x01, 0xEE, 0x01, 0xEE, 0x01, + 0xF0, 0x01, 0xF1, 0x01, 0xF2, 0x01, 0xF1, 0x01, 0xF4, 0x01, 0xF4, 0x01, 0xF6, 0x01, 0xF7, 0x01, + 0xF8, 0x01, 0xF8, 0x01, 0xFA, 0x01, 0xFA, 0x01, 0xFC, 0x01, 0xFC, 0x01, 0xFE, 0x01, 0xFE, 0x01, + 0x00, 0x02, 0x00, 0x02, 0x02, 0x02, 0x02, 0x02, 0x04, 0x02, 0x04, 0x02, 0x06, 0x02, 0x06, 0x02, + 0x08, 0x02, 0x08, 0x02, 0x0A, 0x02, 0x0A, 0x02, 0x0C, 0x02, 0x0C, 0x02, 0x0E, 0x02, 0x0E, 0x02, + 0x10, 0x02, 0x10, 0x02, 0x12, 0x02, 0x12, 0x02, 0x14, 0x02, 0x14, 0x02, 0x16, 0x02, 0x16, 0x02, + 0x18, 0x02, 0x18, 0x02, 0x1A, 0x02, 0x1A, 0x02, 0x1C, 0x02, 0x1C, 0x02, 0x1E, 0x02, 0x1E, 0x02, + 0x20, 0x02, 0x21, 0x02, 0x22, 0x02, 0x22, 0x02, 0x24, 0x02, 0x24, 0x02, 0x26, 0x02, 0x26, 0x02, + 0x28, 0x02, 0x28, 0x02, 0x2A, 0x02, 0x2A, 0x02, 0x2C, 0x02, 0x2C, 0x02, 0x2E, 0x02, 0x2E, 0x02, + 0x30, 0x02, 0x30, 0x02, 0x32, 0x02, 0x32, 0x02, 0x34, 0x02, 0x35, 0x02, 0x36, 0x02, 0x37, 0x02, + 0x38, 0x02, 0x39, 0x02, 0x65, 0x2C, 0x3B, 0x02, 0x3B, 0x02, 0x3D, 0x02, 0x66, 0x2C, 0x3F, 0x02, + 0x40, 0x02, 0x41, 0x02, 0x41, 0x02, 0x43, 0x02, 0x44, 0x02, 0x45, 0x02, 0x46, 0x02, 0x46, 0x02, + 0x48, 0x02, 0x48, 0x02, 0x4A, 0x02, 0x4A, 0x02, 0x4C, 0x02, 0x4C, 0x02, 0x4E, 0x02, 0x4E, 0x02, + 0x50, 0x02, 0x51, 0x02, 0x52, 0x02, 0x81, 0x01, 0x86, 0x01, 0x55, 0x02, 0x89, 0x01, 0x8A, 0x01, + 0x58, 0x02, 0x8F, 0x01, 0x5A, 0x02, 0x90, 0x01, 0x5C, 0x02, 0x5D, 0x02, 0x5E, 0x02, 0x5F, 0x02, + 0x93, 0x01, 0x61, 0x02, 0x62, 0x02, 0x94, 0x01, 0x64, 0x02, 0x65, 0x02, 0x66, 0x02, 0x67, 0x02, + 0x97, 0x01, 0x96, 0x01, 0x6A, 0x02, 0x62, 0x2C, 0x6C, 0x02, 0x6D, 0x02, 0x6E, 0x02, 0x9C, 0x01, + 0x70, 0x02, 0x71, 0x02, 0x9D, 0x01, 0x73, 0x02, 0x74, 0x02, 0x9F, 0x01, 0x76, 0x02, 0x77, 0x02, + 0x78, 0x02, 0x79, 0x02, 0x7A, 0x02, 0x7B, 0x02, 0x7C, 0x02, 0x64, 0x2C, 0x7E, 0x02, 0x7F, 0x02, + 0xA6, 0x01, 0x81, 0x02, 0x82, 0x02, 0xA9, 0x01, 0x84, 0x02, 0x85, 0x02, 0x86, 0x02, 0x87, 0x02, + 0xAE, 0x01, 0x44, 0x02, 0xB1, 0x01, 0xB2, 0x01, 0x45, 0x02, 0x8D, 0x02, 0x8E, 0x02, 0x8F, 0x02, + 0x90, 0x02, 0x91, 0x02, 0xB7, 0x01, 0x93, 0x02, 0x94, 0x02, 0x95, 0x02, 0x96, 0x02, 0x97, 0x02, + 0x98, 0x02, 0x99, 0x02, 0x9A, 0x02, 0x9B, 0x02, 0x9C, 0x02, 0x9D, 0x02, 0x9E, 0x02, 0x9F, 0x02, + 0xA0, 0x02, 0xA1, 0x02, 0xA2, 0x02, 0xA3, 0x02, 0xA4, 0x02, 0xA5, 0x02, 0xA6, 0x02, 0xA7, 0x02, + 0xA8, 0x02, 0xA9, 0x02, 0xAA, 0x02, 0xAB, 0x02, 0xAC, 0x02, 0xAD, 0x02, 0xAE, 0x02, 0xAF, 0x02, + 0xB0, 0x02, 0xB1, 0x02, 0xB2, 0x02, 0xB3, 0x02, 0xB4, 0x02, 0xB5, 0x02, 0xB6, 0x02, 0xB7, 0x02, + 0xB8, 0x02, 0xB9, 0x02, 0xBA, 0x02, 0xBB, 0x02, 0xBC, 0x02, 0xBD, 0x02, 0xBE, 0x02, 0xBF, 0x02, + 0xC0, 0x02, 0xC1, 0x02, 0xC2, 0x02, 0xC3, 0x02, 0xC4, 0x02, 0xC5, 0x02, 0xC6, 0x02, 0xC7, 0x02, + 0xC8, 0x02, 0xC9, 0x02, 0xCA, 0x02, 0xCB, 0x02, 0xCC, 0x02, 0xCD, 0x02, 0xCE, 0x02, 0xCF, 0x02, + 0xD0, 0x02, 0xD1, 0x02, 0xD2, 0x02, 0xD3, 0x02, 0xD4, 0x02, 0xD5, 0x02, 0xD6, 0x02, 0xD7, 0x02, + 0xD8, 0x02, 0xD9, 0x02, 0xDA, 0x02, 0xDB, 0x02, 0xDC, 0x02, 0xDD, 0x02, 0xDE, 0x02, 0xDF, 0x02, + 0xE0, 0x02, 0xE1, 0x02, 0xE2, 0x02, 0xE3, 0x02, 0xE4, 0x02, 0xE5, 0x02, 0xE6, 0x02, 0xE7, 0x02, + 0xE8, 0x02, 0xE9, 0x02, 0xEA, 0x02, 0xEB, 0x02, 0xEC, 0x02, 0xED, 0x02, 0xEE, 0x02, 0xEF, 0x02, + 0xF0, 0x02, 0xF1, 0x02, 0xF2, 0x02, 0xF3, 0x02, 0xF4, 0x02, 0xF5, 0x02, 0xF6, 0x02, 0xF7, 0x02, + 0xF8, 0x02, 0xF9, 0x02, 0xFA, 0x02, 0xFB, 0x02, 0xFC, 0x02, 0xFD, 0x02, 0xFE, 0x02, 0xFF, 0x02, + 0x00, 0x03, 0x01, 0x03, 0x02, 0x03, 0x03, 0x03, 0x04, 0x03, 0x05, 0x03, 0x06, 0x03, 0x07, 0x03, + 0x08, 0x03, 0x09, 0x03, 0x0A, 0x03, 0x0B, 0x03, 0x0C, 0x03, 0x0D, 0x03, 0x0E, 0x03, 0x0F, 0x03, + 0x10, 0x03, 0x11, 0x03, 0x12, 0x03, 0x13, 0x03, 0x14, 0x03, 0x15, 0x03, 0x16, 0x03, 0x17, 0x03, + 0x18, 0x03, 0x19, 0x03, 0x1A, 0x03, 0x1B, 0x03, 0x1C, 0x03, 0x1D, 0x03, 0x1E, 0x03, 0x1F, 0x03, + 0x20, 0x03, 0x21, 0x03, 0x22, 0x03, 0x23, 0x03, 0x24, 0x03, 0x25, 0x03, 0x26, 0x03, 0x27, 0x03, + 0x28, 0x03, 0x29, 0x03, 0x2A, 0x03, 0x2B, 0x03, 0x2C, 0x03, 0x2D, 0x03, 0x2E, 0x03, 0x2F, 0x03, + 0x30, 0x03, 0x31, 0x03, 0x32, 0x03, 0x33, 0x03, 0x34, 0x03, 0x35, 0x03, 0x36, 0x03, 0x37, 0x03, + 0x38, 0x03, 0x39, 0x03, 0x3A, 0x03, 0x3B, 0x03, 0x3C, 0x03, 0x3D, 0x03, 0x3E, 0x03, 0x3F, 0x03, + 0x40, 0x03, 0x41, 0x03, 0x42, 0x03, 0x43, 0x03, 0x44, 0x03, 0x45, 0x03, 0x46, 0x03, 0x47, 0x03, + 0x48, 0x03, 0x49, 0x03, 0x4A, 0x03, 0x4B, 0x03, 0x4C, 0x03, 0x4D, 0x03, 0x4E, 0x03, 0x4F, 0x03, + 0x50, 0x03, 0x51, 0x03, 0x52, 0x03, 0x53, 0x03, 0x54, 0x03, 0x55, 0x03, 0x56, 0x03, 0x57, 0x03, + 0x58, 0x03, 0x59, 0x03, 0x5A, 0x03, 0x5B, 0x03, 0x5C, 0x03, 0x5D, 0x03, 0x5E, 0x03, 0x5F, 0x03, + 0x60, 0x03, 0x61, 0x03, 0x62, 0x03, 0x63, 0x03, 0x64, 0x03, 0x65, 0x03, 0x66, 0x03, 0x67, 0x03, + 0x68, 0x03, 0x69, 0x03, 0x6A, 0x03, 0x6B, 0x03, 0x6C, 0x03, 0x6D, 0x03, 0x6E, 0x03, 0x6F, 0x03, + 0x70, 0x03, 0x71, 0x03, 0x72, 0x03, 0x73, 0x03, 0x74, 0x03, 0x75, 0x03, 0x76, 0x03, 0x77, 0x03, + 0x78, 0x03, 0x79, 0x03, 0x7A, 0x03, 0xFD, 0x03, 0xFE, 0x03, 0xFF, 0x03, 0x7E, 0x03, 0x7F, 0x03, + 0x80, 0x03, 0x81, 0x03, 0x82, 0x03, 0x83, 0x03, 0x84, 0x03, 0x85, 0x03, 0x86, 0x03, 0x87, 0x03, + 0x88, 0x03, 0x89, 0x03, 0x8A, 0x03, 0x8B, 0x03, 0x8C, 0x03, 0x8D, 0x03, 0x8E, 0x03, 0x8F, 0x03, + 0x90, 0x03, 0x91, 0x03, 0x92, 0x03, 0x93, 0x03, 0x94, 0x03, 0x95, 0x03, 0x96, 0x03, 0x97, 0x03, + 0x98, 0x03, 0x99, 0x03, 0x9A, 0x03, 0x9B, 0x03, 0x9C, 0x03, 0x9D, 0x03, 0x9E, 0x03, 0x9F, 0x03, + 0xA0, 0x03, 0xA1, 0x03, 0xA2, 0x03, 0xA3, 0x03, 0xA4, 0x03, 0xA5, 0x03, 0xA6, 0x03, 0xA7, 0x03, + 0xA8, 0x03, 0xA9, 0x03, 0xAA, 0x03, 0xAB, 0x03, 0x86, 0x03, 0x88, 0x03, 0x89, 0x03, 0x8A, 0x03, + 0xB0, 0x03, 0x91, 0x03, 0x92, 0x03, 0x93, 0x03, 0x94, 0x03, 0x95, 0x03, 0x96, 0x03, 0x97, 0x03, + 0x98, 0x03, 0x99, 0x03, 0x9A, 0x03, 0x9B, 0x03, 0x9C, 0x03, 0x9D, 0x03, 0x9E, 0x03, 0x9F, 0x03, + 0xA0, 0x03, 0xA1, 0x03, 0xA3, 0x03, 0xA3, 0x03, 0xA4, 0x03, 0xA5, 0x03, 0xA6, 0x03, 0xA7, 0x03, + 0xA8, 0x03, 0xA9, 0x03, 0xAA, 0x03, 0xAB, 0x03, 0x8C, 0x03, 0x8E, 0x03, 0x8F, 0x03, 0xCF, 0x03, + 0xD0, 0x03, 0xD1, 0x03, 0xD2, 0x03, 0xD3, 0x03, 0xD4, 0x03, 0xD5, 0x03, 0xD6, 0x03, 0xD7, 0x03, + 0xD8, 0x03, 0xD8, 0x03, 0xDA, 0x03, 0xDA, 0x03, 0xDC, 0x03, 0xDC, 0x03, 0xDE, 0x03, 0xDE, 0x03, + 0xE0, 0x03, 0xE0, 0x03, 0xE2, 0x03, 0xE2, 0x03, 0xE4, 0x03, 0xE4, 0x03, 0xE6, 0x03, 0xE6, 0x03, + 0xE8, 0x03, 0xE8, 0x03, 0xEA, 0x03, 0xEA, 0x03, 0xEC, 0x03, 0xEC, 0x03, 0xEE, 0x03, 0xEE, 0x03, + 0xF0, 0x03, 0xF1, 0x03, 0xF9, 0x03, 0xF3, 0x03, 0xF4, 0x03, 0xF5, 0x03, 0xF6, 0x03, 0xF7, 0x03, + 0xF7, 0x03, 0xF9, 0x03, 0xFA, 0x03, 0xFA, 0x03, 0xFC, 0x03, 0xFD, 0x03, 0xFE, 0x03, 0xFF, 0x03, + 0x00, 0x04, 0x01, 0x04, 0x02, 0x04, 0x03, 0x04, 0x04, 0x04, 0x05, 0x04, 0x06, 0x04, 0x07, 0x04, + 0x08, 0x04, 0x09, 0x04, 0x0A, 0x04, 0x0B, 0x04, 0x0C, 0x04, 0x0D, 0x04, 0x0E, 0x04, 0x0F, 0x04, + 0x10, 0x04, 0x11, 0x04, 0x12, 0x04, 0x13, 0x04, 0x14, 0x04, 0x15, 0x04, 0x16, 0x04, 0x17, 0x04, + 0x18, 0x04, 0x19, 0x04, 0x1A, 0x04, 0x1B, 0x04, 0x1C, 0x04, 0x1D, 0x04, 0x1E, 0x04, 0x1F, 0x04, + 0x20, 0x04, 0x21, 0x04, 0x22, 0x04, 0x23, 0x04, 0x24, 0x04, 0x25, 0x04, 0x26, 0x04, 0x27, 0x04, + 0x28, 0x04, 0x29, 0x04, 0x2A, 0x04, 0x2B, 0x04, 0x2C, 0x04, 0x2D, 0x04, 0x2E, 0x04, 0x2F, 0x04, + 0x10, 0x04, 0x11, 0x04, 0x12, 0x04, 0x13, 0x04, 0x14, 0x04, 0x15, 0x04, 0x16, 0x04, 0x17, 0x04, + 0x18, 0x04, 0x19, 0x04, 0x1A, 0x04, 0x1B, 0x04, 0x1C, 0x04, 0x1D, 0x04, 0x1E, 0x04, 0x1F, 0x04, + 0x20, 0x04, 0x21, 0x04, 0x22, 0x04, 0x23, 0x04, 0x24, 0x04, 0x25, 0x04, 0x26, 0x04, 0x27, 0x04, + 0x28, 0x04, 0x29, 0x04, 0x2A, 0x04, 0x2B, 0x04, 0x2C, 0x04, 0x2D, 0x04, 0x2E, 0x04, 0x2F, 0x04, + 0x00, 0x04, 0x01, 0x04, 0x02, 0x04, 0x03, 0x04, 0x04, 0x04, 0x05, 0x04, 0x06, 0x04, 0x07, 0x04, + 0x08, 0x04, 0x09, 0x04, 0x0A, 0x04, 0x0B, 0x04, 0x0C, 0x04, 0x0D, 0x04, 0x0E, 0x04, 0x0F, 0x04, + 0x60, 0x04, 0x60, 0x04, 0x62, 0x04, 0x62, 0x04, 0x64, 0x04, 0x64, 0x04, 0x66, 0x04, 0x66, 0x04, + 0x68, 0x04, 0x68, 0x04, 0x6A, 0x04, 0x6A, 0x04, 0x6C, 0x04, 0x6C, 0x04, 0x6E, 0x04, 0x6E, 0x04, + 0x70, 0x04, 0x70, 0x04, 0x72, 0x04, 0x72, 0x04, 0x74, 0x04, 0x74, 0x04, 0x76, 0x04, 0x76, 0x04, + 0x78, 0x04, 0x78, 0x04, 0x7A, 0x04, 0x7A, 0x04, 0x7C, 0x04, 0x7C, 0x04, 0x7E, 0x04, 0x7E, 0x04, + 0x80, 0x04, 0x80, 0x04, 0x82, 0x04, 0x83, 0x04, 0x84, 0x04, 0x85, 0x04, 0x86, 0x04, 0x87, 0x04, + 0x88, 0x04, 0x89, 0x04, 0x8A, 0x04, 0x8A, 0x04, 0x8C, 0x04, 0x8C, 0x04, 0x8E, 0x04, 0x8E, 0x04, + 0x90, 0x04, 0x90, 0x04, 0x92, 0x04, 0x92, 0x04, 0x94, 0x04, 0x94, 0x04, 0x96, 0x04, 0x96, 0x04, + 0x98, 0x04, 0x98, 0x04, 0x9A, 0x04, 0x9A, 0x04, 0x9C, 0x04, 0x9C, 0x04, 0x9E, 0x04, 0x9E, 0x04, + 0xA0, 0x04, 0xA0, 0x04, 0xA2, 0x04, 0xA2, 0x04, 0xA4, 0x04, 0xA4, 0x04, 0xA6, 0x04, 0xA6, 0x04, + 0xA8, 0x04, 0xA8, 0x04, 0xAA, 0x04, 0xAA, 0x04, 0xAC, 0x04, 0xAC, 0x04, 0xAE, 0x04, 0xAE, 0x04, + 0xB0, 0x04, 0xB0, 0x04, 0xB2, 0x04, 0xB2, 0x04, 0xB4, 0x04, 0xB4, 0x04, 0xB6, 0x04, 0xB6, 0x04, + 0xB8, 0x04, 0xB8, 0x04, 0xBA, 0x04, 0xBA, 0x04, 0xBC, 0x04, 0xBC, 0x04, 0xBE, 0x04, 0xBE, 0x04, + 0xC0, 0x04, 0xC1, 0x04, 0xC1, 0x04, 0xC3, 0x04, 0xC3, 0x04, 0xC5, 0x04, 0xC5, 0x04, 0xC7, 0x04, + 0xC7, 0x04, 0xC9, 0x04, 0xC9, 0x04, 0xCB, 0x04, 0xCB, 0x04, 0xCD, 0x04, 0xCD, 0x04, 0xC0, 0x04, + 0xD0, 0x04, 0xD0, 0x04, 0xD2, 0x04, 0xD2, 0x04, 0xD4, 0x04, 0xD4, 0x04, 0xD6, 0x04, 0xD6, 0x04, + 0xD8, 0x04, 0xD8, 0x04, 0xDA, 0x04, 0xDA, 0x04, 0xDC, 0x04, 0xDC, 0x04, 0xDE, 0x04, 0xDE, 0x04, + 0xE0, 0x04, 0xE0, 0x04, 0xE2, 0x04, 0xE2, 0x04, 0xE4, 0x04, 0xE4, 0x04, 0xE6, 0x04, 0xE6, 0x04, + 0xE8, 0x04, 0xE8, 0x04, 0xEA, 0x04, 0xEA, 0x04, 0xEC, 0x04, 0xEC, 0x04, 0xEE, 0x04, 0xEE, 0x04, + 0xF0, 0x04, 0xF0, 0x04, 0xF2, 0x04, 0xF2, 0x04, 0xF4, 0x04, 0xF4, 0x04, 0xF6, 0x04, 0xF6, 0x04, + 0xF8, 0x04, 0xF8, 0x04, 0xFA, 0x04, 0xFA, 0x04, 0xFC, 0x04, 0xFC, 0x04, 0xFE, 0x04, 0xFE, 0x04, + 0x00, 0x05, 0x00, 0x05, 0x02, 0x05, 0x02, 0x05, 0x04, 0x05, 0x04, 0x05, 0x06, 0x05, 0x06, 0x05, + 0x08, 0x05, 0x08, 0x05, 0x0A, 0x05, 0x0A, 0x05, 0x0C, 0x05, 0x0C, 0x05, 0x0E, 0x05, 0x0E, 0x05, + 0x10, 0x05, 0x10, 0x05, 0x12, 0x05, 0x12, 0x05, 0x14, 0x05, 0x15, 0x05, 0x16, 0x05, 0x17, 0x05, + 0x18, 0x05, 0x19, 0x05, 0x1A, 0x05, 0x1B, 0x05, 0x1C, 0x05, 0x1D, 0x05, 0x1E, 0x05, 0x1F, 0x05, + 0x20, 0x05, 0x21, 0x05, 0x22, 0x05, 0x23, 0x05, 0x24, 0x05, 0x25, 0x05, 0x26, 0x05, 0x27, 0x05, + 0x28, 0x05, 0x29, 0x05, 0x2A, 0x05, 0x2B, 0x05, 0x2C, 0x05, 0x2D, 0x05, 0x2E, 0x05, 0x2F, 0x05, + 0x30, 0x05, 0x31, 0x05, 0x32, 0x05, 0x33, 0x05, 0x34, 0x05, 0x35, 0x05, 0x36, 0x05, 0x37, 0x05, + 0x38, 0x05, 0x39, 0x05, 0x3A, 0x05, 0x3B, 0x05, 0x3C, 0x05, 0x3D, 0x05, 0x3E, 0x05, 0x3F, 0x05, + 0x40, 0x05, 0x41, 0x05, 0x42, 0x05, 0x43, 0x05, 0x44, 0x05, 0x45, 0x05, 0x46, 0x05, 0x47, 0x05, + 0x48, 0x05, 0x49, 0x05, 0x4A, 0x05, 0x4B, 0x05, 0x4C, 0x05, 0x4D, 0x05, 0x4E, 0x05, 0x4F, 0x05, + 0x50, 0x05, 0x51, 0x05, 0x52, 0x05, 0x53, 0x05, 0x54, 0x05, 0x55, 0x05, 0x56, 0x05, 0x57, 0x05, + 0x58, 0x05, 0x59, 0x05, 0x5A, 0x05, 0x5B, 0x05, 0x5C, 0x05, 0x5D, 0x05, 0x5E, 0x05, 0x5F, 0x05, + 0x60, 0x05, 0x31, 0x05, 0x32, 0x05, 0x33, 0x05, 0x34, 0x05, 0x35, 0x05, 0x36, 0x05, 0x37, 0x05, + 0x38, 0x05, 0x39, 0x05, 0x3A, 0x05, 0x3B, 0x05, 0x3C, 0x05, 0x3D, 0x05, 0x3E, 0x05, 0x3F, 0x05, + 0x40, 0x05, 0x41, 0x05, 0x42, 0x05, 0x43, 0x05, 0x44, 0x05, 0x45, 0x05, 0x46, 0x05, 0x47, 0x05, + 0x48, 0x05, 0x49, 0x05, 0x4A, 0x05, 0x4B, 0x05, 0x4C, 0x05, 0x4D, 0x05, 0x4E, 0x05, 0x4F, 0x05, + 0x50, 0x05, 0x51, 0x05, 0x52, 0x05, 0x53, 0x05, 0x54, 0x05, 0x55, 0x05, 0x56, 0x05, 0xFF, 0xFF, + 0xF6, 0x17, 0x63, 0x2C, 0x7E, 0x1D, 0x7F, 0x1D, 0x80, 0x1D, 0x81, 0x1D, 0x82, 0x1D, 0x83, 0x1D, + 0x84, 0x1D, 0x85, 0x1D, 0x86, 0x1D, 0x87, 0x1D, 0x88, 0x1D, 0x89, 0x1D, 0x8A, 0x1D, 0x8B, 0x1D, + 0x8C, 0x1D, 0x8D, 0x1D, 0x8E, 0x1D, 0x8F, 0x1D, 0x90, 0x1D, 0x91, 0x1D, 0x92, 0x1D, 0x93, 0x1D, + 0x94, 0x1D, 0x95, 0x1D, 0x96, 0x1D, 0x97, 0x1D, 0x98, 0x1D, 0x99, 0x1D, 0x9A, 0x1D, 0x9B, 0x1D, + 0x9C, 0x1D, 0x9D, 0x1D, 0x9E, 0x1D, 0x9F, 0x1D, 0xA0, 0x1D, 0xA1, 0x1D, 0xA2, 0x1D, 0xA3, 0x1D, + 0xA4, 0x1D, 0xA5, 0x1D, 0xA6, 0x1D, 0xA7, 0x1D, 0xA8, 0x1D, 0xA9, 0x1D, 0xAA, 0x1D, 0xAB, 0x1D, + 0xAC, 0x1D, 0xAD, 0x1D, 0xAE, 0x1D, 0xAF, 0x1D, 0xB0, 0x1D, 0xB1, 0x1D, 0xB2, 0x1D, 0xB3, 0x1D, + 0xB4, 0x1D, 0xB5, 0x1D, 0xB6, 0x1D, 0xB7, 0x1D, 0xB8, 0x1D, 0xB9, 0x1D, 0xBA, 0x1D, 0xBB, 0x1D, + 0xBC, 0x1D, 0xBD, 0x1D, 0xBE, 0x1D, 0xBF, 0x1D, 0xC0, 0x1D, 0xC1, 0x1D, 0xC2, 0x1D, 0xC3, 0x1D, + 0xC4, 0x1D, 0xC5, 0x1D, 0xC6, 0x1D, 0xC7, 0x1D, 0xC8, 0x1D, 0xC9, 0x1D, 0xCA, 0x1D, 0xCB, 0x1D, + 0xCC, 0x1D, 0xCD, 0x1D, 0xCE, 0x1D, 0xCF, 0x1D, 0xD0, 0x1D, 0xD1, 0x1D, 0xD2, 0x1D, 0xD3, 0x1D, + 0xD4, 0x1D, 0xD5, 0x1D, 0xD6, 0x1D, 0xD7, 0x1D, 0xD8, 0x1D, 0xD9, 0x1D, 0xDA, 0x1D, 0xDB, 0x1D, + 0xDC, 0x1D, 0xDD, 0x1D, 0xDE, 0x1D, 0xDF, 0x1D, 0xE0, 0x1D, 0xE1, 0x1D, 0xE2, 0x1D, 0xE3, 0x1D, + 0xE4, 0x1D, 0xE5, 0x1D, 0xE6, 0x1D, 0xE7, 0x1D, 0xE8, 0x1D, 0xE9, 0x1D, 0xEA, 0x1D, 0xEB, 0x1D, + 0xEC, 0x1D, 0xED, 0x1D, 0xEE, 0x1D, 0xEF, 0x1D, 0xF0, 0x1D, 0xF1, 0x1D, 0xF2, 0x1D, 0xF3, 0x1D, + 0xF4, 0x1D, 0xF5, 0x1D, 0xF6, 0x1D, 0xF7, 0x1D, 0xF8, 0x1D, 0xF9, 0x1D, 0xFA, 0x1D, 0xFB, 0x1D, + 0xFC, 0x1D, 0xFD, 0x1D, 0xFE, 0x1D, 0xFF, 0x1D, 0x00, 0x1E, 0x00, 0x1E, 0x02, 0x1E, 0x02, 0x1E, + 0x04, 0x1E, 0x04, 0x1E, 0x06, 0x1E, 0x06, 0x1E, 0x08, 0x1E, 0x08, 0x1E, 0x0A, 0x1E, 0x0A, 0x1E, + 0x0C, 0x1E, 0x0C, 0x1E, 0x0E, 0x1E, 0x0E, 0x1E, 0x10, 0x1E, 0x10, 0x1E, 0x12, 0x1E, 0x12, 0x1E, + 0x14, 0x1E, 0x14, 0x1E, 0x16, 0x1E, 0x16, 0x1E, 0x18, 0x1E, 0x18, 0x1E, 0x1A, 0x1E, 0x1A, 0x1E, + 0x1C, 0x1E, 0x1C, 0x1E, 0x1E, 0x1E, 0x1E, 0x1E, 0x20, 0x1E, 0x20, 0x1E, 0x22, 0x1E, 0x22, 0x1E, + 0x24, 0x1E, 0x24, 0x1E, 0x26, 0x1E, 0x26, 0x1E, 0x28, 0x1E, 0x28, 0x1E, 0x2A, 0x1E, 0x2A, 0x1E, + 0x2C, 0x1E, 0x2C, 0x1E, 0x2E, 0x1E, 0x2E, 0x1E, 0x30, 0x1E, 0x30, 0x1E, 0x32, 0x1E, 0x32, 0x1E, + 0x34, 0x1E, 0x34, 0x1E, 0x36, 0x1E, 0x36, 0x1E, 0x38, 0x1E, 0x38, 0x1E, 0x3A, 0x1E, 0x3A, 0x1E, + 0x3C, 0x1E, 0x3C, 0x1E, 0x3E, 0x1E, 0x3E, 0x1E, 0x40, 0x1E, 0x40, 0x1E, 0x42, 0x1E, 0x42, 0x1E, + 0x44, 0x1E, 0x44, 0x1E, 0x46, 0x1E, 0x46, 0x1E, 0x48, 0x1E, 0x48, 0x1E, 0x4A, 0x1E, 0x4A, 0x1E, + 0x4C, 0x1E, 0x4C, 0x1E, 0x4E, 0x1E, 0x4E, 0x1E, 0x50, 0x1E, 0x50, 0x1E, 0x52, 0x1E, 0x52, 0x1E, + 0x54, 0x1E, 0x54, 0x1E, 0x56, 0x1E, 0x56, 0x1E, 0x58, 0x1E, 0x58, 0x1E, 0x5A, 0x1E, 0x5A, 0x1E, + 0x5C, 0x1E, 0x5C, 0x1E, 0x5E, 0x1E, 0x5E, 0x1E, 0x60, 0x1E, 0x60, 0x1E, 0x62, 0x1E, 0x62, 0x1E, + 0x64, 0x1E, 0x64, 0x1E, 0x66, 0x1E, 0x66, 0x1E, 0x68, 0x1E, 0x68, 0x1E, 0x6A, 0x1E, 0x6A, 0x1E, + 0x6C, 0x1E, 0x6C, 0x1E, 0x6E, 0x1E, 0x6E, 0x1E, 0x70, 0x1E, 0x70, 0x1E, 0x72, 0x1E, 0x72, 0x1E, + 0x74, 0x1E, 0x74, 0x1E, 0x76, 0x1E, 0x76, 0x1E, 0x78, 0x1E, 0x78, 0x1E, 0x7A, 0x1E, 0x7A, 0x1E, + 0x7C, 0x1E, 0x7C, 0x1E, 0x7E, 0x1E, 0x7E, 0x1E, 0x80, 0x1E, 0x80, 0x1E, 0x82, 0x1E, 0x82, 0x1E, + 0x84, 0x1E, 0x84, 0x1E, 0x86, 0x1E, 0x86, 0x1E, 0x88, 0x1E, 0x88, 0x1E, 0x8A, 0x1E, 0x8A, 0x1E, + 0x8C, 0x1E, 0x8C, 0x1E, 0x8E, 0x1E, 0x8E, 0x1E, 0x90, 0x1E, 0x90, 0x1E, 0x92, 0x1E, 0x92, 0x1E, + 0x94, 0x1E, 0x94, 0x1E, 0x96, 0x1E, 0x97, 0x1E, 0x98, 0x1E, 0x99, 0x1E, 0x9A, 0x1E, 0x9B, 0x1E, + 0x9C, 0x1E, 0x9D, 0x1E, 0x9E, 0x1E, 0x9F, 0x1E, 0xA0, 0x1E, 0xA0, 0x1E, 0xA2, 0x1E, 0xA2, 0x1E, + 0xA4, 0x1E, 0xA4, 0x1E, 0xA6, 0x1E, 0xA6, 0x1E, 0xA8, 0x1E, 0xA8, 0x1E, 0xAA, 0x1E, 0xAA, 0x1E, + 0xAC, 0x1E, 0xAC, 0x1E, 0xAE, 0x1E, 0xAE, 0x1E, 0xB0, 0x1E, 0xB0, 0x1E, 0xB2, 0x1E, 0xB2, 0x1E, + 0xB4, 0x1E, 0xB4, 0x1E, 0xB6, 0x1E, 0xB6, 0x1E, 0xB8, 0x1E, 0xB8, 0x1E, 0xBA, 0x1E, 0xBA, 0x1E, + 0xBC, 0x1E, 0xBC, 0x1E, 0xBE, 0x1E, 0xBE, 0x1E, 0xC0, 0x1E, 0xC0, 0x1E, 0xC2, 0x1E, 0xC2, 0x1E, + 0xC4, 0x1E, 0xC4, 0x1E, 0xC6, 0x1E, 0xC6, 0x1E, 0xC8, 0x1E, 0xC8, 0x1E, 0xCA, 0x1E, 0xCA, 0x1E, + 0xCC, 0x1E, 0xCC, 0x1E, 0xCE, 0x1E, 0xCE, 0x1E, 0xD0, 0x1E, 0xD0, 0x1E, 0xD2, 0x1E, 0xD2, 0x1E, + 0xD4, 0x1E, 0xD4, 0x1E, 0xD6, 0x1E, 0xD6, 0x1E, 0xD8, 0x1E, 0xD8, 0x1E, 0xDA, 0x1E, 0xDA, 0x1E, + 0xDC, 0x1E, 0xDC, 0x1E, 0xDE, 0x1E, 0xDE, 0x1E, 0xE0, 0x1E, 0xE0, 0x1E, 0xE2, 0x1E, 0xE2, 0x1E, + 0xE4, 0x1E, 0xE4, 0x1E, 0xE6, 0x1E, 0xE6, 0x1E, 0xE8, 0x1E, 0xE8, 0x1E, 0xEA, 0x1E, 0xEA, 0x1E, + 0xEC, 0x1E, 0xEC, 0x1E, 0xEE, 0x1E, 0xEE, 0x1E, 0xF0, 0x1E, 0xF0, 0x1E, 0xF2, 0x1E, 0xF2, 0x1E, + 0xF4, 0x1E, 0xF4, 0x1E, 0xF6, 0x1E, 0xF6, 0x1E, 0xF8, 0x1E, 0xF8, 0x1E, 0xFA, 0x1E, 0xFB, 0x1E, + 0xFC, 0x1E, 0xFD, 0x1E, 0xFE, 0x1E, 0xFF, 0x1E, 0x08, 0x1F, 0x09, 0x1F, 0x0A, 0x1F, 0x0B, 0x1F, + 0x0C, 0x1F, 0x0D, 0x1F, 0x0E, 0x1F, 0x0F, 0x1F, 0x08, 0x1F, 0x09, 0x1F, 0x0A, 0x1F, 0x0B, 0x1F, + 0x0C, 0x1F, 0x0D, 0x1F, 0x0E, 0x1F, 0x0F, 0x1F, 0x18, 0x1F, 0x19, 0x1F, 0x1A, 0x1F, 0x1B, 0x1F, + 0x1C, 0x1F, 0x1D, 0x1F, 0x16, 0x1F, 0x17, 0x1F, 0x18, 0x1F, 0x19, 0x1F, 0x1A, 0x1F, 0x1B, 0x1F, + 0x1C, 0x1F, 0x1D, 0x1F, 0x1E, 0x1F, 0x1F, 0x1F, 0x28, 0x1F, 0x29, 0x1F, 0x2A, 0x1F, 0x2B, 0x1F, + 0x2C, 0x1F, 0x2D, 0x1F, 0x2E, 0x1F, 0x2F, 0x1F, 0x28, 0x1F, 0x29, 0x1F, 0x2A, 0x1F, 0x2B, 0x1F, + 0x2C, 0x1F, 0x2D, 0x1F, 0x2E, 0x1F, 0x2F, 0x1F, 0x38, 0x1F, 0x39, 0x1F, 0x3A, 0x1F, 0x3B, 0x1F, + 0x3C, 0x1F, 0x3D, 0x1F, 0x3E, 0x1F, 0x3F, 0x1F, 0x38, 0x1F, 0x39, 0x1F, 0x3A, 0x1F, 0x3B, 0x1F, + 0x3C, 0x1F, 0x3D, 0x1F, 0x3E, 0x1F, 0x3F, 0x1F, 0x48, 0x1F, 0x49, 0x1F, 0x4A, 0x1F, 0x4B, 0x1F, + 0x4C, 0x1F, 0x4D, 0x1F, 0x46, 0x1F, 0x47, 0x1F, 0x48, 0x1F, 0x49, 0x1F, 0x4A, 0x1F, 0x4B, 0x1F, + 0x4C, 0x1F, 0x4D, 0x1F, 0x4E, 0x1F, 0x4F, 0x1F, 0x50, 0x1F, 0x59, 0x1F, 0x52, 0x1F, 0x5B, 0x1F, + 0x54, 0x1F, 0x5D, 0x1F, 0x56, 0x1F, 0x5F, 0x1F, 0x58, 0x1F, 0x59, 0x1F, 0x5A, 0x1F, 0x5B, 0x1F, + 0x5C, 0x1F, 0x5D, 0x1F, 0x5E, 0x1F, 0x5F, 0x1F, 0x68, 0x1F, 0x69, 0x1F, 0x6A, 0x1F, 0x6B, 0x1F, + 0x6C, 0x1F, 0x6D, 0x1F, 0x6E, 0x1F, 0x6F, 0x1F, 0x68, 0x1F, 0x69, 0x1F, 0x6A, 0x1F, 0x6B, 0x1F, + 0x6C, 0x1F, 0x6D, 0x1F, 0x6E, 0x1F, 0x6F, 0x1F, 0xBA, 0x1F, 0xBB, 0x1F, 0xC8, 0x1F, 0xC9, 0x1F, + 0xCA, 0x1F, 0xCB, 0x1F, 0xDA, 0x1F, 0xDB, 0x1F, 0xF8, 0x1F, 0xF9, 0x1F, 0xEA, 0x1F, 0xEB, 0x1F, + 0xFA, 0x1F, 0xFB, 0x1F, 0x7E, 0x1F, 0x7F, 0x1F, 0x88, 0x1F, 0x89, 0x1F, 0x8A, 0x1F, 0x8B, 0x1F, + 0x8C, 0x1F, 0x8D, 0x1F, 0x8E, 0x1F, 0x8F, 0x1F, 0x88, 0x1F, 0x89, 0x1F, 0x8A, 0x1F, 0x8B, 0x1F, + 0x8C, 0x1F, 0x8D, 0x1F, 0x8E, 0x1F, 0x8F, 0x1F, 0x98, 0x1F, 0x99, 0x1F, 0x9A, 0x1F, 0x9B, 0x1F, + 0x9C, 0x1F, 0x9D, 0x1F, 0x9E, 0x1F, 0x9F, 0x1F, 0x98, 0x1F, 0x99, 0x1F, 0x9A, 0x1F, 0x9B, 0x1F, + 0x9C, 0x1F, 0x9D, 0x1F, 0x9E, 0x1F, 0x9F, 0x1F, 0xA8, 0x1F, 0xA9, 0x1F, 0xAA, 0x1F, 0xAB, 0x1F, + 0xAC, 0x1F, 0xAD, 0x1F, 0xAE, 0x1F, 0xAF, 0x1F, 0xA8, 0x1F, 0xA9, 0x1F, 0xAA, 0x1F, 0xAB, 0x1F, + 0xAC, 0x1F, 0xAD, 0x1F, 0xAE, 0x1F, 0xAF, 0x1F, 0xB8, 0x1F, 0xB9, 0x1F, 0xB2, 0x1F, 0xBC, 0x1F, + 0xB4, 0x1F, 0xB5, 0x1F, 0xB6, 0x1F, 0xB7, 0x1F, 0xB8, 0x1F, 0xB9, 0x1F, 0xBA, 0x1F, 0xBB, 0x1F, + 0xBC, 0x1F, 0xBD, 0x1F, 0xBE, 0x1F, 0xBF, 0x1F, 0xC0, 0x1F, 0xC1, 0x1F, 0xC2, 0x1F, 0xC3, 0x1F, + 0xC4, 0x1F, 0xC5, 0x1F, 0xC6, 0x1F, 0xC7, 0x1F, 0xC8, 0x1F, 0xC9, 0x1F, 0xCA, 0x1F, 0xCB, 0x1F, + 0xC3, 0x1F, 0xCD, 0x1F, 0xCE, 0x1F, 0xCF, 0x1F, 0xD8, 0x1F, 0xD9, 0x1F, 0xD2, 0x1F, 0xD3, 0x1F, + 0xD4, 0x1F, 0xD5, 0x1F, 0xD6, 0x1F, 0xD7, 0x1F, 0xD8, 0x1F, 0xD9, 0x1F, 0xDA, 0x1F, 0xDB, 0x1F, + 0xDC, 0x1F, 0xDD, 0x1F, 0xDE, 0x1F, 0xDF, 0x1F, 0xE8, 0x1F, 0xE9, 0x1F, 0xE2, 0x1F, 0xE3, 0x1F, + 0xE4, 0x1F, 0xEC, 0x1F, 0xE6, 0x1F, 0xE7, 0x1F, 0xE8, 0x1F, 0xE9, 0x1F, 0xEA, 0x1F, 0xEB, 0x1F, + 0xEC, 0x1F, 0xED, 0x1F, 0xEE, 0x1F, 0xEF, 0x1F, 0xF0, 0x1F, 0xF1, 0x1F, 0xF2, 0x1F, 0xF3, 0x1F, + 0xF4, 0x1F, 0xF5, 0x1F, 0xF6, 0x1F, 0xF7, 0x1F, 0xF8, 0x1F, 0xF9, 0x1F, 0xFA, 0x1F, 0xFB, 0x1F, + 0xF3, 0x1F, 0xFD, 0x1F, 0xFE, 0x1F, 0xFF, 0x1F, 0x00, 0x20, 0x01, 0x20, 0x02, 0x20, 0x03, 0x20, + 0x04, 0x20, 0x05, 0x20, 0x06, 0x20, 0x07, 0x20, 0x08, 0x20, 0x09, 0x20, 0x0A, 0x20, 0x0B, 0x20, + 0x0C, 0x20, 0x0D, 0x20, 0x0E, 0x20, 0x0F, 0x20, 0x10, 0x20, 0x11, 0x20, 0x12, 0x20, 0x13, 0x20, + 0x14, 0x20, 0x15, 0x20, 0x16, 0x20, 0x17, 0x20, 0x18, 0x20, 0x19, 0x20, 0x1A, 0x20, 0x1B, 0x20, + 0x1C, 0x20, 0x1D, 0x20, 0x1E, 0x20, 0x1F, 0x20, 0x20, 0x20, 0x21, 0x20, 0x22, 0x20, 0x23, 0x20, + 0x24, 0x20, 0x25, 0x20, 0x26, 0x20, 0x27, 0x20, 0x28, 0x20, 0x29, 0x20, 0x2A, 0x20, 0x2B, 0x20, + 0x2C, 0x20, 0x2D, 0x20, 0x2E, 0x20, 0x2F, 0x20, 0x30, 0x20, 0x31, 0x20, 0x32, 0x20, 0x33, 0x20, + 0x34, 0x20, 0x35, 0x20, 0x36, 0x20, 0x37, 0x20, 0x38, 0x20, 0x39, 0x20, 0x3A, 0x20, 0x3B, 0x20, + 0x3C, 0x20, 0x3D, 0x20, 0x3E, 0x20, 0x3F, 0x20, 0x40, 0x20, 0x41, 0x20, 0x42, 0x20, 0x43, 0x20, + 0x44, 0x20, 0x45, 0x20, 0x46, 0x20, 0x47, 0x20, 0x48, 0x20, 0x49, 0x20, 0x4A, 0x20, 0x4B, 0x20, + 0x4C, 0x20, 0x4D, 0x20, 0x4E, 0x20, 0x4F, 0x20, 0x50, 0x20, 0x51, 0x20, 0x52, 0x20, 0x53, 0x20, + 0x54, 0x20, 0x55, 0x20, 0x56, 0x20, 0x57, 0x20, 0x58, 0x20, 0x59, 0x20, 0x5A, 0x20, 0x5B, 0x20, + 0x5C, 0x20, 0x5D, 0x20, 0x5E, 0x20, 0x5F, 0x20, 0x60, 0x20, 0x61, 0x20, 0x62, 0x20, 0x63, 0x20, + 0x64, 0x20, 0x65, 0x20, 0x66, 0x20, 0x67, 0x20, 0x68, 0x20, 0x69, 0x20, 0x6A, 0x20, 0x6B, 0x20, + 0x6C, 0x20, 0x6D, 0x20, 0x6E, 0x20, 0x6F, 0x20, 0x70, 0x20, 0x71, 0x20, 0x72, 0x20, 0x73, 0x20, + 0x74, 0x20, 0x75, 0x20, 0x76, 0x20, 0x77, 0x20, 0x78, 0x20, 0x79, 0x20, 0x7A, 0x20, 0x7B, 0x20, + 0x7C, 0x20, 0x7D, 0x20, 0x7E, 0x20, 0x7F, 0x20, 0x80, 0x20, 0x81, 0x20, 0x82, 0x20, 0x83, 0x20, + 0x84, 0x20, 0x85, 0x20, 0x86, 0x20, 0x87, 0x20, 0x88, 0x20, 0x89, 0x20, 0x8A, 0x20, 0x8B, 0x20, + 0x8C, 0x20, 0x8D, 0x20, 0x8E, 0x20, 0x8F, 0x20, 0x90, 0x20, 0x91, 0x20, 0x92, 0x20, 0x93, 0x20, + 0x94, 0x20, 0x95, 0x20, 0x96, 0x20, 0x97, 0x20, 0x98, 0x20, 0x99, 0x20, 0x9A, 0x20, 0x9B, 0x20, + 0x9C, 0x20, 0x9D, 0x20, 0x9E, 0x20, 0x9F, 0x20, 0xA0, 0x20, 0xA1, 0x20, 0xA2, 0x20, 0xA3, 0x20, + 0xA4, 0x20, 0xA5, 0x20, 0xA6, 0x20, 0xA7, 0x20, 0xA8, 0x20, 0xA9, 0x20, 0xAA, 0x20, 0xAB, 0x20, + 0xAC, 0x20, 0xAD, 0x20, 0xAE, 0x20, 0xAF, 0x20, 0xB0, 0x20, 0xB1, 0x20, 0xB2, 0x20, 0xB3, 0x20, + 0xB4, 0x20, 0xB5, 0x20, 0xB6, 0x20, 0xB7, 0x20, 0xB8, 0x20, 0xB9, 0x20, 0xBA, 0x20, 0xBB, 0x20, + 0xBC, 0x20, 0xBD, 0x20, 0xBE, 0x20, 0xBF, 0x20, 0xC0, 0x20, 0xC1, 0x20, 0xC2, 0x20, 0xC3, 0x20, + 0xC4, 0x20, 0xC5, 0x20, 0xC6, 0x20, 0xC7, 0x20, 0xC8, 0x20, 0xC9, 0x20, 0xCA, 0x20, 0xCB, 0x20, + 0xCC, 0x20, 0xCD, 0x20, 0xCE, 0x20, 0xCF, 0x20, 0xD0, 0x20, 0xD1, 0x20, 0xD2, 0x20, 0xD3, 0x20, + 0xD4, 0x20, 0xD5, 0x20, 0xD6, 0x20, 0xD7, 0x20, 0xD8, 0x20, 0xD9, 0x20, 0xDA, 0x20, 0xDB, 0x20, + 0xDC, 0x20, 0xDD, 0x20, 0xDE, 0x20, 0xDF, 0x20, 0xE0, 0x20, 0xE1, 0x20, 0xE2, 0x20, 0xE3, 0x20, + 0xE4, 0x20, 0xE5, 0x20, 0xE6, 0x20, 0xE7, 0x20, 0xE8, 0x20, 0xE9, 0x20, 0xEA, 0x20, 0xEB, 0x20, + 0xEC, 0x20, 0xED, 0x20, 0xEE, 0x20, 0xEF, 0x20, 0xF0, 0x20, 0xF1, 0x20, 0xF2, 0x20, 0xF3, 0x20, + 0xF4, 0x20, 0xF5, 0x20, 0xF6, 0x20, 0xF7, 0x20, 0xF8, 0x20, 0xF9, 0x20, 0xFA, 0x20, 0xFB, 0x20, + 0xFC, 0x20, 0xFD, 0x20, 0xFE, 0x20, 0xFF, 0x20, 0x00, 0x21, 0x01, 0x21, 0x02, 0x21, 0x03, 0x21, + 0x04, 0x21, 0x05, 0x21, 0x06, 0x21, 0x07, 0x21, 0x08, 0x21, 0x09, 0x21, 0x0A, 0x21, 0x0B, 0x21, + 0x0C, 0x21, 0x0D, 0x21, 0x0E, 0x21, 0x0F, 0x21, 0x10, 0x21, 0x11, 0x21, 0x12, 0x21, 0x13, 0x21, + 0x14, 0x21, 0x15, 0x21, 0x16, 0x21, 0x17, 0x21, 0x18, 0x21, 0x19, 0x21, 0x1A, 0x21, 0x1B, 0x21, + 0x1C, 0x21, 0x1D, 0x21, 0x1E, 0x21, 0x1F, 0x21, 0x20, 0x21, 0x21, 0x21, 0x22, 0x21, 0x23, 0x21, + 0x24, 0x21, 0x25, 0x21, 0x26, 0x21, 0x27, 0x21, 0x28, 0x21, 0x29, 0x21, 0x2A, 0x21, 0x2B, 0x21, + 0x2C, 0x21, 0x2D, 0x21, 0x2E, 0x21, 0x2F, 0x21, 0x30, 0x21, 0x31, 0x21, 0x32, 0x21, 0x33, 0x21, + 0x34, 0x21, 0x35, 0x21, 0x36, 0x21, 0x37, 0x21, 0x38, 0x21, 0x39, 0x21, 0x3A, 0x21, 0x3B, 0x21, + 0x3C, 0x21, 0x3D, 0x21, 0x3E, 0x21, 0x3F, 0x21, 0x40, 0x21, 0x41, 0x21, 0x42, 0x21, 0x43, 0x21, + 0x44, 0x21, 0x45, 0x21, 0x46, 0x21, 0x47, 0x21, 0x48, 0x21, 0x49, 0x21, 0x4A, 0x21, 0x4B, 0x21, + 0x4C, 0x21, 0x4D, 0x21, 0x32, 0x21, 0x4F, 0x21, 0x50, 0x21, 0x51, 0x21, 0x52, 0x21, 0x53, 0x21, + 0x54, 0x21, 0x55, 0x21, 0x56, 0x21, 0x57, 0x21, 0x58, 0x21, 0x59, 0x21, 0x5A, 0x21, 0x5B, 0x21, + 0x5C, 0x21, 0x5D, 0x21, 0x5E, 0x21, 0x5F, 0x21, 0x60, 0x21, 0x61, 0x21, 0x62, 0x21, 0x63, 0x21, + 0x64, 0x21, 0x65, 0x21, 0x66, 0x21, 0x67, 0x21, 0x68, 0x21, 0x69, 0x21, 0x6A, 0x21, 0x6B, 0x21, + 0x6C, 0x21, 0x6D, 0x21, 0x6E, 0x21, 0x6F, 0x21, 0x60, 0x21, 0x61, 0x21, 0x62, 0x21, 0x63, 0x21, + 0x64, 0x21, 0x65, 0x21, 0x66, 0x21, 0x67, 0x21, 0x68, 0x21, 0x69, 0x21, 0x6A, 0x21, 0x6B, 0x21, + 0x6C, 0x21, 0x6D, 0x21, 0x6E, 0x21, 0x6F, 0x21, 0x80, 0x21, 0x81, 0x21, 0x82, 0x21, 0x83, 0x21, + 0x83, 0x21, 0xFF, 0xFF, 0x4B, 0x03, 0xB6, 0x24, 0xB7, 0x24, 0xB8, 0x24, 0xB9, 0x24, 0xBA, 0x24, + 0xBB, 0x24, 0xBC, 0x24, 0xBD, 0x24, 0xBE, 0x24, 0xBF, 0x24, 0xC0, 0x24, 0xC1, 0x24, 0xC2, 0x24, + 0xC3, 0x24, 0xC4, 0x24, 0xC5, 0x24, 0xC6, 0x24, 0xC7, 0x24, 0xC8, 0x24, 0xC9, 0x24, 0xCA, 0x24, + 0xCB, 0x24, 0xCC, 0x24, 0xCD, 0x24, 0xCE, 0x24, 0xCF, 0x24, 0xFF, 0xFF, 0x46, 0x07, 0x00, 0x2C, + 0x01, 0x2C, 0x02, 0x2C, 0x03, 0x2C, 0x04, 0x2C, 0x05, 0x2C, 0x06, 0x2C, 0x07, 0x2C, 0x08, 0x2C, + 0x09, 0x2C, 0x0A, 0x2C, 0x0B, 0x2C, 0x0C, 0x2C, 0x0D, 0x2C, 0x0E, 0x2C, 0x0F, 0x2C, 0x10, 0x2C, + 0x11, 0x2C, 0x12, 0x2C, 0x13, 0x2C, 0x14, 0x2C, 0x15, 0x2C, 0x16, 0x2C, 0x17, 0x2C, 0x18, 0x2C, + 0x19, 0x2C, 0x1A, 0x2C, 0x1B, 0x2C, 0x1C, 0x2C, 0x1D, 0x2C, 0x1E, 0x2C, 0x1F, 0x2C, 0x20, 0x2C, + 0x21, 0x2C, 0x22, 0x2C, 0x23, 0x2C, 0x24, 0x2C, 0x25, 0x2C, 0x26, 0x2C, 0x27, 0x2C, 0x28, 0x2C, + 0x29, 0x2C, 0x2A, 0x2C, 0x2B, 0x2C, 0x2C, 0x2C, 0x2D, 0x2C, 0x2E, 0x2C, 0x5F, 0x2C, 0x60, 0x2C, + 0x60, 0x2C, 0x62, 0x2C, 0x63, 0x2C, 0x64, 0x2C, 0x65, 0x2C, 0x66, 0x2C, 0x67, 0x2C, 0x67, 0x2C, + 0x69, 0x2C, 0x69, 0x2C, 0x6B, 0x2C, 0x6B, 0x2C, 0x6D, 0x2C, 0x6E, 0x2C, 0x6F, 0x2C, 0x70, 0x2C, + 0x71, 0x2C, 0x72, 0x2C, 0x73, 0x2C, 0x74, 0x2C, 0x75, 0x2C, 0x75, 0x2C, 0x77, 0x2C, 0x78, 0x2C, + 0x79, 0x2C, 0x7A, 0x2C, 0x7B, 0x2C, 0x7C, 0x2C, 0x7D, 0x2C, 0x7E, 0x2C, 0x7F, 0x2C, 0x80, 0x2C, + 0x80, 0x2C, 0x82, 0x2C, 0x82, 0x2C, 0x84, 0x2C, 0x84, 0x2C, 0x86, 0x2C, 0x86, 0x2C, 0x88, 0x2C, + 0x88, 0x2C, 0x8A, 0x2C, 0x8A, 0x2C, 0x8C, 0x2C, 0x8C, 0x2C, 0x8E, 0x2C, 0x8E, 0x2C, 0x90, 0x2C, + 0x90, 0x2C, 0x92, 0x2C, 0x92, 0x2C, 0x94, 0x2C, 0x94, 0x2C, 0x96, 0x2C, 0x96, 0x2C, 0x98, 0x2C, + 0x98, 0x2C, 0x9A, 0x2C, 0x9A, 0x2C, 0x9C, 0x2C, 0x9C, 0x2C, 0x9E, 0x2C, 0x9E, 0x2C, 0xA0, 0x2C, + 0xA0, 0x2C, 0xA2, 0x2C, 0xA2, 0x2C, 0xA4, 0x2C, 0xA4, 0x2C, 0xA6, 0x2C, 0xA6, 0x2C, 0xA8, 0x2C, + 0xA8, 0x2C, 0xAA, 0x2C, 0xAA, 0x2C, 0xAC, 0x2C, 0xAC, 0x2C, 0xAE, 0x2C, 0xAE, 0x2C, 0xB0, 0x2C, + 0xB0, 0x2C, 0xB2, 0x2C, 0xB2, 0x2C, 0xB4, 0x2C, 0xB4, 0x2C, 0xB6, 0x2C, 0xB6, 0x2C, 0xB8, 0x2C, + 0xB8, 0x2C, 0xBA, 0x2C, 0xBA, 0x2C, 0xBC, 0x2C, 0xBC, 0x2C, 0xBE, 0x2C, 0xBE, 0x2C, 0xC0, 0x2C, + 0xC0, 0x2C, 0xC2, 0x2C, 0xC2, 0x2C, 0xC4, 0x2C, 0xC4, 0x2C, 0xC6, 0x2C, 0xC6, 0x2C, 0xC8, 0x2C, + 0xC8, 0x2C, 0xCA, 0x2C, 0xCA, 0x2C, 0xCC, 0x2C, 0xCC, 0x2C, 0xCE, 0x2C, 0xCE, 0x2C, 0xD0, 0x2C, + 0xD0, 0x2C, 0xD2, 0x2C, 0xD2, 0x2C, 0xD4, 0x2C, 0xD4, 0x2C, 0xD6, 0x2C, 0xD6, 0x2C, 0xD8, 0x2C, + 0xD8, 0x2C, 0xDA, 0x2C, 0xDA, 0x2C, 0xDC, 0x2C, 0xDC, 0x2C, 0xDE, 0x2C, 0xDE, 0x2C, 0xE0, 0x2C, + 0xE0, 0x2C, 0xE2, 0x2C, 0xE2, 0x2C, 0xE4, 0x2C, 0xE5, 0x2C, 0xE6, 0x2C, 0xE7, 0x2C, 0xE8, 0x2C, + 0xE9, 0x2C, 0xEA, 0x2C, 0xEB, 0x2C, 0xEC, 0x2C, 0xED, 0x2C, 0xEE, 0x2C, 0xEF, 0x2C, 0xF0, 0x2C, + 0xF1, 0x2C, 0xF2, 0x2C, 0xF3, 0x2C, 0xF4, 0x2C, 0xF5, 0x2C, 0xF6, 0x2C, 0xF7, 0x2C, 0xF8, 0x2C, + 0xF9, 0x2C, 0xFA, 0x2C, 0xFB, 0x2C, 0xFC, 0x2C, 0xFD, 0x2C, 0xFE, 0x2C, 0xFF, 0x2C, 0xA0, 0x10, + 0xA1, 0x10, 0xA2, 0x10, 0xA3, 0x10, 0xA4, 0x10, 0xA5, 0x10, 0xA6, 0x10, 0xA7, 0x10, 0xA8, 0x10, + 0xA9, 0x10, 0xAA, 0x10, 0xAB, 0x10, 0xAC, 0x10, 0xAD, 0x10, 0xAE, 0x10, 0xAF, 0x10, 0xB0, 0x10, + 0xB1, 0x10, 0xB2, 0x10, 0xB3, 0x10, 0xB4, 0x10, 0xB5, 0x10, 0xB6, 0x10, 0xB7, 0x10, 0xB8, 0x10, + 0xB9, 0x10, 0xBA, 0x10, 0xBB, 0x10, 0xBC, 0x10, 0xBD, 0x10, 0xBE, 0x10, 0xBF, 0x10, 0xC0, 0x10, + 0xC1, 0x10, 0xC2, 0x10, 0xC3, 0x10, 0xC4, 0x10, 0xC5, 0x10, 0xFF, 0xFF, 0x1B, 0xD2, 0x21, 0xFF, + 0x22, 0xFF, 0x23, 0xFF, 0x24, 0xFF, 0x25, 0xFF, 0x26, 0xFF, 0x27, 0xFF, 0x28, 0xFF, 0x29, 0xFF, + 0x2A, 0xFF, 0x2B, 0xFF, 0x2C, 0xFF, 0x2D, 0xFF, 0x2E, 0xFF, 0x2F, 0xFF, 0x30, 0xFF, 0x31, 0xFF, + 0x32, 0xFF, 0x33, 0xFF, 0x34, 0xFF, 0x35, 0xFF, 0x36, 0xFF, 0x37, 0xFF, 0x38, 0xFF, 0x39, 0xFF, + 0x3A, 0xFF, 0x5B, 0xFF, 0x5C, 0xFF, 0x5D, 0xFF, 0x5E, 0xFF, 0x5F, 0xFF, 0x60, 0xFF, 0x61, 0xFF, + 0x62, 0xFF, 0x63, 0xFF, 0x64, 0xFF, 0x65, 0xFF, 0x66, 0xFF, 0x67, 0xFF, 0x68, 0xFF, 0x69, 0xFF, + 0x6A, 0xFF, 0x6B, 0xFF, 0x6C, 0xFF, 0x6D, 0xFF, 0x6E, 0xFF, 0x6F, 0xFF, 0x70, 0xFF, 0x71, 0xFF, + 0x72, 0xFF, 0x73, 0xFF, 0x74, 0xFF, 0x75, 0xFF, 0x76, 0xFF, 0x77, 0xFF, 0x78, 0xFF, 0x79, 0xFF, + 0x7A, 0xFF, 0x7B, 0xFF, 0x7C, 0xFF, 0x7D, 0xFF, 0x7E, 0xFF, 0x7F, 0xFF, 0x80, 0xFF, 0x81, 0xFF, + 0x82, 0xFF, 0x83, 0xFF, 0x84, 0xFF, 0x85, 0xFF, 0x86, 0xFF, 0x87, 0xFF, 0x88, 0xFF, 0x89, 0xFF, + 0x8A, 0xFF, 0x8B, 0xFF, 0x8C, 0xFF, 0x8D, 0xFF, 0x8E, 0xFF, 0x8F, 0xFF, 0x90, 0xFF, 0x91, 0xFF, + 0x92, 0xFF, 0x93, 0xFF, 0x94, 0xFF, 0x95, 0xFF, 0x96, 0xFF, 0x97, 0xFF, 0x98, 0xFF, 0x99, 0xFF, + 0x9A, 0xFF, 0x9B, 0xFF, 0x9C, 0xFF, 0x9D, 0xFF, 0x9E, 0xFF, 0x9F, 0xFF, 0xA0, 0xFF, 0xA1, 0xFF, + 0xA2, 0xFF, 0xA3, 0xFF, 0xA4, 0xFF, 0xA5, 0xFF, 0xA6, 0xFF, 0xA7, 0xFF, 0xA8, 0xFF, 0xA9, 0xFF, + 0xAA, 0xFF, 0xAB, 0xFF, 0xAC, 0xFF, 0xAD, 0xFF, 0xAE, 0xFF, 0xAF, 0xFF, 0xB0, 0xFF, 0xB1, 0xFF, + 0xB2, 0xFF, 0xB3, 0xFF, 0xB4, 0xFF, 0xB5, 0xFF, 0xB6, 0xFF, 0xB7, 0xFF, 0xB8, 0xFF, 0xB9, 0xFF, + 0xBA, 0xFF, 0xBB, 0xFF, 0xBC, 0xFF, 0xBD, 0xFF, 0xBE, 0xFF, 0xBF, 0xFF, 0xC0, 0xFF, 0xC1, 0xFF, + 0xC2, 0xFF, 0xC3, 0xFF, 0xC4, 0xFF, 0xC5, 0xFF, 0xC6, 0xFF, 0xC7, 0xFF, 0xC8, 0xFF, 0xC9, 0xFF, + 0xCA, 0xFF, 0xCB, 0xFF, 0xCC, 0xFF, 0xCD, 0xFF, 0xCE, 0xFF, 0xCF, 0xFF, 0xD0, 0xFF, 0xD1, 0xFF, + 0xD2, 0xFF, 0xD3, 0xFF, 0xD4, 0xFF, 0xD5, 0xFF, 0xD6, 0xFF, 0xD7, 0xFF, 0xD8, 0xFF, 0xD9, 0xFF, + 0xDA, 0xFF, 0xDB, 0xFF, 0xDC, 0xFF, 0xDD, 0xFF, 0xDE, 0xFF, 0xDF, 0xFF, 0xE0, 0xFF, 0xE1, 0xFF, + 0xE2, 0xFF, 0xE3, 0xFF, 0xE4, 0xFF, 0xE5, 0xFF, 0xE6, 0xFF, 0xE7, 0xFF, 0xE8, 0xFF, 0xE9, 0xFF, + 0xEA, 0xFF, 0xEB, 0xFF, 0xEC, 0xFF, 0xED, 0xFF, 0xEE, 0xFF, 0xEF, 0xFF, 0xF0, 0xFF, 0xF1, 0xFF, + 0xF2, 0xFF, 0xF3, 0xFF, 0xF4, 0xFF, 0xF5, 0xFF, 0xF6, 0xFF, 0xF7, 0xFF, 0xF8, 0xFF, 0xF9, 0xFF, + 0xFA, 0xFF, 0xFB, 0xFF, 0xFC, 0xFF, 0xFD, 0xFF, 0xFE, 0xFF, 0xFF, 0xFF +}; diff --git b/fs/exfat/exfat_version.h b/fs/exfat/exfat_version.h new file mode 100644 index 0000000..a93fa46 --- /dev/null +++ b/fs/exfat/exfat_version.h @@ -0,0 +1,19 @@ +/************************************************************************/ +/* */ +/* PROJECT : exFAT & FAT12/16/32 File System */ +/* FILE : exfat_version.h */ +/* PURPOSE : exFAT File Manager */ +/* */ +/*----------------------------------------------------------------------*/ +/* NOTES */ +/* */ +/*----------------------------------------------------------------------*/ +/* REVISION HISTORY */ +/* */ +/* - 2012.02.10 : Release Version 1.1.0 */ +/* - 2012.04.02 : P1 : Change Module License to Samsung Proprietary */ +/* - 2012.06.07 : P2 : Fixed incorrect filename problem */ +/* */ +/************************************************************************/ + +#define EXFAT_VERSION "1.2.9" diff --git a/fs/open.c b/fs/open.c index 55bdc75..16f561e 100644 --- a/fs/open.c +++ b/fs/open.c @@ -34,6 +34,9 @@ #include "internal.h" +#define CREATE_TRACE_POINTS +#include + int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs, struct file *filp) { @@ -1026,6 +1029,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) } else { fsnotify_open(f); fd_install(fd, f); + trace_do_sys_open(tmp->name, flags, mode); } } putname(tmp); diff --git a/fs/super.c b/fs/super.c index 74914b1..ad57841 100644 --- a/fs/super.c +++ b/fs/super.c @@ -36,7 +36,7 @@ #include "internal.h" -static LIST_HEAD(super_blocks); +LIST_HEAD(super_blocks); static DEFINE_SPINLOCK(sb_lock); static char *sb_writers_name[SB_FREEZE_LEVELS] = { diff --git a/include/linux/bio.h b/include/linux/bio.h index 88bc64f..b721a9d 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -32,6 +32,8 @@ /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */ #include +extern int trap_non_toi_io; + #define BIO_DEBUG #ifdef BIO_DEBUG diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 86a38ea..fee9e23 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -120,13 +120,14 @@ struct bio { #define BIO_QUIET 6 /* Make BIO Quiet */ #define BIO_CHAIN 7 /* chained bio, ->bi_remaining in effect */ #define BIO_REFFED 8 /* bio has elevated ->bi_cnt */ +#define BIO_TOI 9 /* bio is TuxOnIce submitted */ /* * Flags starting here get preserved by bio_reset() - this includes * BIO_POOL_IDX() */ -#define BIO_RESET_BITS 13 -#define BIO_OWNS_VEC 13 /* bio_free() should free bvec */ +#define BIO_RESET_BITS 14 +#define BIO_OWNS_VEC 14 /* bio_free() should free bvec */ /* * top 4 bits of bio flags indicate the pool this bio came from diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 413c84f..8fec976 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -39,13 +39,17 @@ struct blk_flush_queue; struct pr_ops; #define BLKDEV_MIN_RQ 4 +#ifdef CONFIG_PCK_INTERACTIVE +#define BLKDEV_MAX_RQ 16 +#else #define BLKDEV_MAX_RQ 128 /* Default maximum */ +#endif /* * Maximum number of blkcg policies allowed to be registered concurrently. * Defined here to simplify include dependency. */ -#define BLKCG_MAX_POLS 2 +#define BLKCG_MAX_POLS 3 struct request; typedef void (rq_end_io_fn)(struct request *, int); diff --git a/include/linux/console.h b/include/linux/console.h index ea731af..3a61cf7 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -119,7 +119,7 @@ static inline int con_debug_leave(void) struct console { char name[16]; - void (*write)(struct console *, const char *, unsigned); + void (*write)(struct console *, const char *, unsigned, unsigned int); int (*read)(struct console *, char *, unsigned); struct tty_driver *(*device)(struct console *, int *); void (*unblank)(void); diff --git a/include/linux/fs.h b/include/linux/fs.h index ae68100..cf958ec 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1278,6 +1278,8 @@ struct mm_struct; #define UMOUNT_NOFOLLOW 0x00000008 /* Don't follow symlink on umount */ #define UMOUNT_UNUSED 0x80000000 /* Flag guaranteed to be unused */ +extern struct list_head super_blocks; + /* sb->s_iflags */ #define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */ #define SB_I_NOEXEC 0x00000002 /* Ignore executables on this fs */ @@ -1774,6 +1776,8 @@ struct super_operations { #else #define S_DAX 0 /* Make all the DAX code disappear */ #endif +#define S_ATOMIC_COPY 16384 /* Pages mapped with this inode need to be + atomically copied (gem) */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -2306,6 +2310,13 @@ extern struct super_block *freeze_bdev(struct block_device *); extern void emergency_thaw_all(void); extern int thaw_bdev(struct block_device *bdev, struct super_block *sb); extern int fsync_bdev(struct block_device *); +extern int fsync_super(struct super_block *); +extern int fsync_no_super(struct block_device *); +#define FS_FREEZER_FUSE 1 +#define FS_FREEZER_NORMAL 2 +#define FS_FREEZER_ALL (FS_FREEZER_FUSE | FS_FREEZER_NORMAL) +void freeze_filesystems(int which); +void thaw_filesystems(int which); #ifdef CONFIG_FS_DAX extern bool blkdev_dax_capable(struct block_device *bdev); #else diff --git b/include/linux/fs_uuid.h b/include/linux/fs_uuid.h new file mode 100644 index 0000000..3234135 --- /dev/null +++ b/include/linux/fs_uuid.h @@ -0,0 +1,19 @@ +#include + +struct hd_struct; +struct block_device; + +struct fs_info { + char uuid[16]; + dev_t dev_t; + char *last_mount; + int last_mount_size; +}; + +int part_matches_fs_info(struct hd_struct *part, struct fs_info *seek); +dev_t blk_lookup_fs_info(struct fs_info *seek); +struct fs_info *fs_info_from_block_dev(struct block_device *bdev); +void free_fs_info(struct fs_info *fs_info); +int bdev_matches_key(struct block_device *bdev, const char *key); +struct block_device *next_bdev_of_type(struct block_device *last, + const char *key); diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index c2b340e..6d9df3f 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -713,6 +713,18 @@ static inline void __ftrace_enabled_restore(int enabled) #define CALLER_ADDR5 ((unsigned long)ftrace_return_address(5)) #define CALLER_ADDR6 ((unsigned long)ftrace_return_address(6)) +static inline unsigned long get_lock_parent_ip(void) +{ + unsigned long addr = CALLER_ADDR0; + + if (!in_lock_functions(addr)) + return addr; + addr = CALLER_ADDR1; + if (!in_lock_functions(addr)) + return addr; + return CALLER_ADDR2; +} + #ifdef CONFIG_IRQSOFF_TRACER extern void time_hardirqs_on(unsigned long a0, unsigned long a1); extern void time_hardirqs_off(unsigned long a0, unsigned long a1); diff --git a/include/linux/gfp.h b/include/linux/gfp.h index af1f2b2..0b6edf7 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,7 +35,8 @@ struct vm_area_struct; #define ___GFP_DIRECT_RECLAIM 0x400000u #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u -#define ___GFP_KSWAPD_RECLAIM 0x2000000u +#define ___GFP_TOI_NOTRACK 0x2000000u +#define ___GFP_KSWAPD_RECLAIM 0x4000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -153,6 +154,7 @@ struct vm_area_struct; #define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT) #define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL) #define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) +#define __GFP_TOI_NOTRACK ((__force gfp_t)___GFP_TOI_NOTRACK) /* Allocator wants page untracked by TOI */ /* * Action modifiers @@ -186,7 +188,7 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT 26 +#define __GFP_BITS_SHIFT 27 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 861f690..5276fe0 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -218,7 +219,7 @@ struct kvm_vcpu { int fpu_active; int guest_fpu_loaded, guest_xcr0_loaded; unsigned char fpu_counter; - wait_queue_head_t wq; + struct swait_queue_head wq; struct pid *pid; int sigset_active; sigset_t sigset; @@ -782,7 +783,7 @@ static inline bool kvm_arch_has_assigned_device(struct kvm *kvm) } #endif -static inline wait_queue_head_t *kvm_arch_vcpu_wq(struct kvm_vcpu *vcpu) +static inline struct swait_queue_head *kvm_arch_vcpu_wq(struct kvm_vcpu *vcpu) { #ifdef __KVM_HAVE_ARCH_WQP return vcpu->arch.wqp; diff --git a/include/linux/latencytop.h b/include/linux/latencytop.h index e23121f..59ccab2 100644 --- a/include/linux/latencytop.h +++ b/include/linux/latencytop.h @@ -37,6 +37,9 @@ account_scheduler_latency(struct task_struct *task, int usecs, int inter) void clear_all_latency_tracing(struct task_struct *p); +extern int sysctl_latencytop(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos); + #else static inline void diff --git a/include/linux/linux_logo.h b/include/linux/linux_logo.h index ca5bd91..5489dcb 100644 --- a/include/linux/linux_logo.h +++ b/include/linux/linux_logo.h @@ -37,6 +37,18 @@ extern const struct linux_logo logo_linux_vga16; extern const struct linux_logo logo_linux_clut224; extern const struct linux_logo logo_blackfin_vga16; extern const struct linux_logo logo_blackfin_clut224; +extern const struct linux_logo logo_zen_clut224; +extern const struct linux_logo logo_oldzen_clut224; +extern const struct linux_logo logo_arch_clut224; +extern const struct linux_logo logo_gentoo_clut224; +extern const struct linux_logo logo_exherbo_clut224; +extern const struct linux_logo logo_slackware_clut224; +extern const struct linux_logo logo_debian_clut224; +extern const struct linux_logo logo_fedorasimple_clut224; +extern const struct linux_logo logo_fedoraglossy_clut224; +extern const struct linux_logo logo_tits_clut224; +extern const struct linux_logo logo_bsd_clut224; +extern const struct linux_logo logo_fbsd_clut224; extern const struct linux_logo logo_dec_clut224; extern const struct linux_logo logo_mac_clut224; extern const struct linux_logo logo_parisc_clut224; diff --git a/include/linux/mm.h b/include/linux/mm.h index 516e149..ccc992c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2225,6 +2225,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); #endif +void drop_pagecache(void); void drop_slab(void); void drop_slab_node(int nid); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 19724e6..2e7ecb7 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -101,6 +101,12 @@ enum pageflags { #ifdef CONFIG_MEMORY_FAILURE PG_hwpoison, /* hardware poisoned page. Don't touch */ #endif +#ifdef CONFIG_TOI_INCREMENTAL + PG_toi_untracked, /* Don't track dirtiness of this page - assume always dirty */ + PG_toi_ro, /* Page was made RO by TOI */ + PG_toi_cbw, /* Copy the page before it is written to */ + PG_toi_dirty, /* Page has been modified */ +#endif #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) PG_young, PG_idle, @@ -343,6 +349,17 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY) PAGEFLAG_FALSE(HWPoison) #define __PG_HWPOISON 0 #endif +#ifdef CONFIG_TOI_INCREMENTAL +PAGEFLAG(TOI_RO, toi_ro) +PAGEFLAG(TOI_Dirty, toi_dirty) +PAGEFLAG(TOI_Untracked, toi_untracked) +PAGEFLAG(TOI_CBW, toi_cbw) +#else +PAGEFLAG_FALSE(TOI_RO) +PAGEFLAG_FALSE(TOI_Dirty) +PAGEFLAG_FALSE(TOI_Untracked) +PAGEFLAG_FALSE(TOI_CBW) +#endif #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) TESTPAGEFLAG(Young, young, PF_ANY) @@ -665,8 +682,12 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) * __PG_HWPOISON is exceptional because it needs to be kept beyond page's * alloc-free cycle to prevent from reusing the page. */ -#define PAGE_FLAGS_CHECK_AT_PREP \ - (((1 << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) +#ifdef CONFIG_TOI_INCREMENTAL +#define PAGE_FLAGS_CHECK_AT_PREP (((1 << NR_PAGEFLAGS) - 1) & \ + ~((1 << PG_toi_dirty) | (1 << PG_toi_ro) | ~__PG_HWPOISON)) +#else +#define PAGE_FLAGS_CHECK_AT_PREP (((1 << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) +#endif #define PAGE_FLAGS_PRIVATE \ (1 << PG_private | 1 << PG_private_2) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index c9e4731..4f080ab 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -95,4 +95,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +extern int sysctl_schedstats(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); + #endif /* _SCHED_SYSCTL_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index a10494a..838a89a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -182,8 +182,6 @@ extern void update_cpu_load_nohz(int active); static inline void update_cpu_load_nohz(int active) { } #endif -extern unsigned long get_parent_ip(unsigned long addr); - extern void dump_cpu_task(int cpu); struct seq_file; @@ -920,6 +918,10 @@ static inline int sched_info_on(void) #endif } +#ifdef CONFIG_SCHEDSTATS +void force_schedstat_enabled(void); +#endif + enum cpu_idle_type { CPU_IDLE, CPU_NOT_IDLE, @@ -1289,6 +1291,8 @@ struct sched_rt_entity { unsigned long timeout; unsigned long watchdog_stamp; unsigned int time_slice; + unsigned short on_rq; + unsigned short on_list; struct sched_rt_entity *back; #ifdef CONFIG_RT_GROUP_SCHED @@ -1329,10 +1333,6 @@ struct sched_dl_entity { * task has to wait for a replenishment to be performed at the * next firing of dl_timer. * - * @dl_new tells if a new instance arrived. If so we must - * start executing it with full runtime and reset its absolute - * deadline; - * * @dl_boosted tells if we are boosted due to DI. If so we are * outside bandwidth enforcement mechanism (but only until we * exit the critical section); @@ -1340,7 +1340,7 @@ struct sched_dl_entity { * @dl_yielded tells if task gave up the cpu before consuming * all its available runtime during the last job. */ - int dl_throttled, dl_new, dl_boosted, dl_yielded; + int dl_throttled, dl_boosted, dl_yielded; /* * Bandwidth enforcement timer. Each -deadline task has its diff --git a/include/linux/serial_8250.h b/include/linux/serial_8250.h index faa0e03..0d4ef69 100644 --- a/include/linux/serial_8250.h +++ b/include/linux/serial_8250.h @@ -153,7 +153,7 @@ unsigned int serial8250_modem_status(struct uart_8250_port *up); void serial8250_init_port(struct uart_8250_port *up); void serial8250_set_defaults(struct uart_8250_port *up); void serial8250_console_write(struct uart_8250_port *up, const char *s, - unsigned int count); + unsigned int count, unsigned int loglevel); int serial8250_console_setup(struct uart_port *port, char *options, bool probe); extern void serial8250_set_isa_configurator(void (*v) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 4d4780c..665cd85 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -45,9 +45,10 @@ static inline struct shmem_inode_info *SHMEM_I(struct inode *inode) extern int shmem_init(void); extern int shmem_fill_super(struct super_block *sb, void *data, int silent); extern struct file *shmem_file_setup(const char *name, - loff_t size, unsigned long flags); + loff_t size, unsigned long flags, + int atomic_copy); extern struct file *shmem_kernel_file_setup(const char *name, loff_t size, - unsigned long flags); + unsigned long flags, int atomic_copy); extern int shmem_zero_setup(struct vm_area_struct *); extern int shmem_lock(struct file *file, int lock, struct user_struct *user); extern bool shmem_mapping(struct address_space *mapping); diff --git a/include/linux/suspend.h b/include/linux/suspend.h index 8b6ec7e..0e527a7 100644 --- a/include/linux/suspend.h +++ b/include/linux/suspend.h @@ -491,6 +491,74 @@ extern bool pm_print_times_enabled; #define pm_print_times_enabled (false) #endif +enum { + TOI_CAN_HIBERNATE, + TOI_CAN_RESUME, + TOI_RESUME_DEVICE_OK, + TOI_NORESUME_SPECIFIED, + TOI_SANITY_CHECK_PROMPT, + TOI_CONTINUE_REQ, + TOI_RESUMED_BEFORE, + TOI_BOOT_TIME, + TOI_NOW_RESUMING, + TOI_IGNORE_LOGLEVEL, + TOI_TRYING_TO_RESUME, + TOI_LOADING_ALT_IMAGE, + TOI_STOP_RESUME, + TOI_IO_STOPPED, + TOI_NOTIFIERS_PREPARE, + TOI_CLUSTER_MODE, + TOI_BOOT_KERNEL, + TOI_DEVICE_HOTPLUG_LOCKED, +}; + +#ifdef CONFIG_TOI + +/* Used in init dir files */ +extern unsigned long toi_state; +#define set_toi_state(bit) (set_bit(bit, &toi_state)) +#define clear_toi_state(bit) (clear_bit(bit, &toi_state)) +#define test_toi_state(bit) (test_bit(bit, &toi_state)) +extern int toi_running; + +#define test_action_state(bit) (test_bit(bit, &toi_bkd.toi_action)) +extern int try_tuxonice_hibernate(void); + +#else /* !CONFIG_TOI */ + +#define toi_state (0) +#define set_toi_state(bit) do { } while (0) +#define clear_toi_state(bit) do { } while (0) +#define test_toi_state(bit) (0) +#define toi_running (0) + +static inline int try_tuxonice_hibernate(void) { return 0; } +#define test_action_state(bit) (0) + +#endif /* CONFIG_TOI */ + +#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_TOI +extern void try_tuxonice_resume(void); +#else +#define try_tuxonice_resume() do { } while (0) +#endif + +extern int resume_attempted; +extern int software_resume(void); + +static inline void check_resume_attempted(void) +{ + if (resume_attempted) + return; + + software_resume(); +} +#else +#define check_resume_attempted() do { } while (0) +#define resume_attempted (0) +#endif + #ifdef CONFIG_PM_AUTOSLEEP /* kernel/power/autosleep.c */ diff --git b/include/linux/swait.h b/include/linux/swait.h new file mode 100644 index 0000000..c1f9c62 --- /dev/null +++ b/include/linux/swait.h @@ -0,0 +1,172 @@ +#ifndef _LINUX_SWAIT_H +#define _LINUX_SWAIT_H + +#include +#include +#include +#include + +/* + * Simple wait queues + * + * While these are very similar to the other/complex wait queues (wait.h) the + * most important difference is that the simple waitqueue allows for + * deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold + * times. + * + * In order to make this so, we had to drop a fair number of features of the + * other waitqueue code; notably: + * + * - mixing INTERRUPTIBLE and UNINTERRUPTIBLE sleeps on the same waitqueue; + * all wakeups are TASK_NORMAL in order to avoid O(n) lookups for the right + * sleeper state. + * + * - the exclusive mode; because this requires preserving the list order + * and this is hard. + * + * - custom wake functions; because you cannot give any guarantees about + * random code. + * + * As a side effect of this; the data structures are slimmer. + * + * One would recommend using this wait queue where possible. + */ + +struct task_struct; + +struct swait_queue_head { + raw_spinlock_t lock; + struct list_head task_list; +}; + +struct swait_queue { + struct task_struct *task; + struct list_head task_list; +}; + +#define __SWAITQUEUE_INITIALIZER(name) { \ + .task = current, \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAITQUEUE(name) \ + struct swait_queue name = __SWAITQUEUE_INITIALIZER(name) + +#define __SWAIT_QUEUE_HEAD_INITIALIZER(name) { \ + .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAIT_QUEUE_HEAD(name) \ + struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INITIALIZER(name) + +extern void __init_swait_queue_head(struct swait_queue_head *q, const char *name, + struct lock_class_key *key); + +#define init_swait_queue_head(q) \ + do { \ + static struct lock_class_key __key; \ + __init_swait_queue_head((q), #q, &__key); \ + } while (0) + +#ifdef CONFIG_LOCKDEP +# define __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ + ({ init_swait_queue_head(&name); name; }) +# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name) \ + struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) +#else +# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name) \ + DECLARE_SWAIT_QUEUE_HEAD(name) +#endif + +static inline int swait_active(struct swait_queue_head *q) +{ + return !list_empty(&q->task_list); +} + +extern void swake_up(struct swait_queue_head *q); +extern void swake_up_all(struct swait_queue_head *q); +extern void swake_up_locked(struct swait_queue_head *q); + +extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait); +extern void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state); +extern long prepare_to_swait_event(struct swait_queue_head *q, struct swait_queue *wait, int state); + +extern void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait); +extern void finish_swait(struct swait_queue_head *q, struct swait_queue *wait); + +/* as per ___wait_event() but for swait, therefore "exclusive == 0" */ +#define ___swait_event(wq, condition, state, ret, cmd) \ +({ \ + struct swait_queue __wait; \ + long __ret = ret; \ + \ + INIT_LIST_HEAD(&__wait.task_list); \ + for (;;) { \ + long __int = prepare_to_swait_event(&wq, &__wait, state);\ + \ + if (condition) \ + break; \ + \ + if (___wait_is_interruptible(state) && __int) { \ + __ret = __int; \ + break; \ + } \ + \ + cmd; \ + } \ + finish_swait(&wq, &__wait); \ + __ret; \ +}) + +#define __swait_event(wq, condition) \ + (void)___swait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, \ + schedule()) + +#define swait_event(wq, condition) \ +do { \ + if (condition) \ + break; \ + __swait_event(wq, condition); \ +} while (0) + +#define __swait_event_timeout(wq, condition, timeout) \ + ___swait_event(wq, ___wait_cond_timeout(condition), \ + TASK_UNINTERRUPTIBLE, timeout, \ + __ret = schedule_timeout(__ret)) + +#define swait_event_timeout(wq, condition, timeout) \ +({ \ + long __ret = timeout; \ + if (!___wait_cond_timeout(condition)) \ + __ret = __swait_event_timeout(wq, condition, timeout); \ + __ret; \ +}) + +#define __swait_event_interruptible(wq, condition) \ + ___swait_event(wq, condition, TASK_INTERRUPTIBLE, 0, \ + schedule()) + +#define swait_event_interruptible(wq, condition) \ +({ \ + int __ret = 0; \ + if (!(condition)) \ + __ret = __swait_event_interruptible(wq, condition); \ + __ret; \ +}) + +#define __swait_event_interruptible_timeout(wq, condition, timeout) \ + ___swait_event(wq, ___wait_cond_timeout(condition), \ + TASK_INTERRUPTIBLE, timeout, \ + __ret = schedule_timeout(__ret)) + +#define swait_event_interruptible_timeout(wq, condition, timeout) \ +({ \ + long __ret = timeout; \ + if (!___wait_cond_timeout(condition)) \ + __ret = __swait_event_interruptible_timeout(wq, \ + condition, timeout); \ + __ret; \ +}) + +#endif /* _LINUX_SWAIT_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index d18b65c..44324f1 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -288,6 +288,7 @@ static inline void workingset_node_shadows_dec(struct radix_tree_node *node) extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; extern unsigned long nr_free_buffer_pages(void); +extern unsigned long nr_unallocated_buffer_pages(void); extern unsigned long nr_free_pagecache_pages(void); /* Definition of global_page_state not available yet */ @@ -328,6 +329,8 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, struct zone *zone, unsigned long *nr_scanned); extern unsigned long shrink_all_memory(unsigned long nr_pages); +extern unsigned long shrink_memory_mask(unsigned long nr_to_reclaim, + gfp_t mask); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages; @@ -413,14 +416,18 @@ extern void swapcache_free(swp_entry_t); extern int free_swap_and_cache(swp_entry_t); extern int swap_type_of(dev_t, sector_t, struct block_device **); extern unsigned int count_swap_pages(int, int); +extern sector_t map_swap_entry(swp_entry_t entry, struct block_device **); extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); +extern struct swap_info_struct *get_swap_info_struct(unsigned); extern int page_swapcount(struct page *); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); extern int reuse_swap_page(struct page *); extern int try_to_free_swap(struct page *); struct backing_dev_info; +extern void get_swap_range_of_type(int type, swp_entry_t *start, + swp_entry_t *end, unsigned int limit); #else /* CONFIG_SWAP */ diff --git a/include/linux/tcp.h b/include/linux/tcp.h index b386361..4e92c0f 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -19,6 +19,7 @@ #include +#include #include #include #include @@ -336,6 +337,21 @@ struct tcp_sock { struct tcp_md5sig_info __rcu *md5sig_info; #endif +#ifdef CONFIG_TCP_STEALTH +/* Stealth TCP socket configuration */ + struct { + #define TCP_STEALTH_MODE_AUTH BIT(0) + #define TCP_STEALTH_MODE_INTEGRITY BIT(1) + #define TCP_STEALTH_MODE_INTEGRITY_LEN BIT(2) + u8 mode; + u8 secret[MD5_MESSAGE_BYTES]; + u16 integrity_hash; + size_t integrity_len; + struct skb_mstamp mstamp; + bool saw_tsval; + } stealth; +#endif + /* TCP fastopen related information */ struct tcp_fastopen_request *fastopen_req; /* fastopen_rsk points to request_sock that resulted in this big diff --git b/include/linux/thinkpad_ec.h b/include/linux/thinkpad_ec.h new file mode 100644 index 0000000..1b80d7e --- /dev/null +++ b/include/linux/thinkpad_ec.h @@ -0,0 +1,47 @@ +/* + * thinkpad_ec.h - interface to ThinkPad embedded controller LPC3 functions + * + * Copyright (C) 2005 Shem Multinymous + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _THINKPAD_EC_H +#define _THINKPAD_EC_H + +#ifdef __KERNEL__ + +#define TP_CONTROLLER_ROW_LEN 16 + +/* EC transactions input and output (possibly partial) vectors of 16 bytes. */ +struct thinkpad_ec_row { + u16 mask; /* bitmap of which entries of val[] are meaningful */ + u8 val[TP_CONTROLLER_ROW_LEN]; +}; + +extern int __must_check thinkpad_ec_lock(void); +extern int __must_check thinkpad_ec_try_lock(void); +extern void thinkpad_ec_unlock(void); + +extern int thinkpad_ec_read_row(const struct thinkpad_ec_row *args, + struct thinkpad_ec_row *data); +extern int thinkpad_ec_try_read_row(const struct thinkpad_ec_row *args, + struct thinkpad_ec_row *mask); +extern int thinkpad_ec_prefetch_row(const struct thinkpad_ec_row *args); +extern void thinkpad_ec_invalidate(void); + + +#endif /* __KERNEL */ +#endif /* _THINKPAD_EC_H */ diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h index b4c2a48..c54e16a 100644 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -56,10 +56,10 @@ extern long do_no_restart_syscall(struct restart_block *parm); #ifdef __KERNEL__ #ifdef CONFIG_DEBUG_STACK_USAGE -# define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_NOTRACK | \ +# define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_NOTRACK | ___GFP_TOI_NOTRACK | \ __GFP_ZERO) #else -# define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_NOTRACK) +# define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_NOTRACK | ___GFP_TOI_NOTRACK) #endif /* diff --git b/include/linux/tuxonice.h b/include/linux/tuxonice.h new file mode 100644 index 0000000..67b05a7 --- /dev/null +++ b/include/linux/tuxonice.h @@ -0,0 +1,48 @@ +/* + * include/linux/tuxonice.h + * + * Copyright (C) 2015 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + */ + +#ifndef INCLUDE_LINUX_TUXONICE_H +#define INCLUDE_LINUX_TUXONICE_H +#ifdef CONFIG_TOI_INCREMENTAL +extern void toi_set_logbuf_untracked(void); +extern int toi_make_writable(pgd_t *pgd, unsigned long address); + +static inline int toi_incremental_support(void) +{ + return 1; +} + +/* Copy Before Write */ +struct toi_cbw { + unsigned long pfn; + void *virt; + struct toi_cbw *next; +}; + +struct toi_cbw_state { + bool active; /* Is a fault handler running? */ + bool enabled; /* Are we doing copy before write? */ + int size; /* The number of pages allocated */ + struct toi_cbw *first, *next, *last; /* Pointers to the data structure */ +}; + +#define CBWS_PER_PAGE (PAGE_SIZE / sizeof(struct toi_cbw)) +DECLARE_PER_CPU(struct toi_cbw_state, toi_cbw_states); +#else +#define toi_set_logbuf_untracked() do { } while(0) +static inline int toi_make_writable(pgd_t *pgd, unsigned long addr) +{ + return 0; +} + +static inline int toi_incremental_support(void) +{ + return 0; +} +#endif +#endif diff --git a/include/linux/wait.h b/include/linux/wait.h index ae71a76..27d7a0a 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -338,7 +338,7 @@ do { \ schedule(); try_to_freeze()) /** - * wait_event - sleep (or freeze) until a condition gets true + * wait_event_freezable - sleep (or freeze) until a condition gets true * @wq: the waitqueue to wait on * @condition: a C expression for the event to wait for * diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h index 3f36d45..ec392a1 100644 --- a/include/net/secure_seq.h +++ b/include/net/secure_seq.h @@ -14,5 +14,10 @@ u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport); u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr, __be16 sport, __be16 dport); +#ifdef CONFIG_TCP_STEALTH +u32 tcp_stealth_do_auth(struct sock *sk, struct sk_buff *skb); +u32 tcp_stealth_sequence_number(struct sock *sk, __be32 *daddr, + u32 daddr_size, __be16 dport); +#endif #endif /* _NET_SECURE_SEQ */ diff --git a/include/net/tcp.h b/include/net/tcp.h index ae6468f..7b9c0ff 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -440,6 +440,12 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx, int estab, struct tcp_fastopen_cookie *foc); const u8 *tcp_parse_md5sig_option(const struct tcphdr *th); +#ifdef CONFIG_TCP_STEALTH +const bool tcp_parse_tsval_option(u32 *tsval, const struct tcphdr *th); +int tcp_stealth_integrity(u16 *hash, u8 *secret, u8 *payload, int len); +#define be32_isn_to_be16_av(x) (((__be16 *)&x)[0]) +#define be32_isn_to_be16_ih(x) (((__be16 *)&x)[1]) +#endif /* * TCP v4 functions exported for the inet6 API diff --git b/include/trace/events/fs.h b/include/trace/events/fs.h new file mode 100644 index 0000000..fb634b7 --- /dev/null +++ b/include/trace/events/fs.h @@ -0,0 +1,53 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM fs + +#if !defined(_TRACE_FS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_FS_H + +#include +#include + +TRACE_EVENT(do_sys_open, + + TP_PROTO(const char *filename, int flags, int mode), + + TP_ARGS(filename, flags, mode), + + TP_STRUCT__entry( + __string( filename, filename ) + __field( int, flags ) + __field( int, mode ) + ), + + TP_fast_assign( + __assign_str(filename, filename); + __entry->flags = flags; + __entry->mode = mode; + ), + + TP_printk("\"%s\" %x %o", + __get_str(filename), __entry->flags, __entry->mode) +); + +TRACE_EVENT(open_exec, + + TP_PROTO(const char *filename), + + TP_ARGS(filename), + + TP_STRUCT__entry( + __string( filename, filename ) + ), + + TP_fast_assign( + __assign_str(filename, filename); + ), + + TP_printk("\"%s\"", + __get_str(filename)) +); + +#endif /* _TRACE_FS_H */ + +/* This part must be outside protection */ +#include diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h index f095155..f95de72 100644 --- a/include/uapi/linux/netlink.h +++ b/include/uapi/linux/netlink.h @@ -27,6 +27,8 @@ #define NETLINK_ECRYPTFS 19 #define NETLINK_RDMA 20 #define NETLINK_CRYPTO 21 /* Crypto layer */ +#define NETLINK_TOI_USERUI 22 /* TuxOnIce's userui */ +#define NETLINK_TOI_USM 23 /* Userspace storage manager */ #define NETLINK_INET_DIAG NETLINK_SOCK_DIAG diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 65a77b0..9a0dfc2 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -115,6 +115,9 @@ enum { #define TCP_CC_INFO 26 /* Get Congestion Control (optional) info */ #define TCP_SAVE_SYN 27 /* Record SYN headers for new connections */ #define TCP_SAVED_SYN 28 /* Get SYN headers recorded for connection */ +#define TCP_STEALTH 29 +#define TCP_STEALTH_INTEGRITY 30 +#define TCP_STEALTH_INTEGRITY_LEN 31 struct tcp_repair_opt { __u32 opt_code; diff --git a/include/uapi/linux/vt.h b/include/uapi/linux/vt.h index 978578b..6d2e54e 100644 --- a/include/uapi/linux/vt.h +++ b/include/uapi/linux/vt.h @@ -3,12 +3,26 @@ /* + * We will make this definition solely for the purpose of making packages + * such as splashutils build, because they can not understand that + * NR_TTY_DEVICES is defined in the kernel configuration. + */ +#ifndef CONFIG_NR_TTY_DEVICES +#define CONFIG_NR_TTY_DEVICES 63 +#endif + +/* * These constants are also useful for user-level apps (e.g., VC * resizing). */ #define MIN_NR_CONSOLES 1 /* must be at least 1 */ -#define MAX_NR_CONSOLES 63 /* serial lines start at 64 */ -#define MAX_NR_USER_CONSOLES 63 /* must be root to allocate above this */ + +/* + * NR_TTY_DEVICES: + * Value MUST be at least 11 and must never be higher then 63 + */ +#define MAX_NR_CONSOLES CONFIG_NR_TTY_DEVICES /* serial lines start above this */ +#define MAX_NR_USER_CONSOLES CONFIG_NR_TTY_DEVICES /* must be root to allocate above this */ /* Note: the ioctl VT_GETSTATE does not work for consoles 16 and higher (since it returns a short) */ diff --git a/init/Kconfig b/init/Kconfig index 2232080..ec251fa 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -28,6 +28,34 @@ config BUILDTIME_EXTABLE_SORT menu "General setup" +config PCK_INTERACTIVE + bool "Tune kernel for interactivity" + default y + help + Tunes the kernel for responsiveness at the cost of throughput and power usage. + + --- Virtual Memory Subsystem --------------------------- + + Mem dirty before bg writeback..: 10 % -> 20 % + Mem dirty before sync writeback: 20 % -> 50 % + + --- Block Layer ---------------------------------------- + + NCQ Queue Depth................: 31 -> 8 + Block Layer Queue Depth........: 128 -> 16 + + --- CPU Scheduler -------------------------------------- + + Scheduling latency.............: 6 -> 3 ms + Minimal granularity............: 0.75 -> 0.3 ms + Wakeup granularity.............: 1 -> 0.5 ms + CPU migration cost.............: 0.5 -> 0.25 ms + Bandwidth slice size...........: 5 -> 3 ms + + --- CPU Frequency Scaling ------------------------------ + + Ondemand down scaling factor...: 1 -> 10 + config BROKEN bool @@ -1283,13 +1311,36 @@ source "usr/Kconfig" endif +choice + prompt "Code optimization level" + default CC_OPTIMIZE_DEFAULT + help + Select the optimization flag to pass to the compiler, + affecting kernel size, speed and compilation time. + + If in doubt, choose the default optimization level. + +config CC_OPTIMIZE_DEFAULT + bool "Default optimization" + help + This option will pass "-O2" to your compiler. This is + the recommended level and the most tested. + +config CC_OPTIMIZE_HARDER + bool "Optimize harder" + help + This option will pass "-O3" to your compiler resulting + in a larger and faster kernel. The more complex + optimizations also increase compilation time and may + affect stability. + config CC_OPTIMIZE_FOR_SIZE bool "Optimize for size" help - Enabling this option will pass "-Os" instead of "-O2" to - your compiler resulting in a smaller kernel. + This option will pass "-Os" to your compiler resulting + in a smaller kernel. - If unsure, say N. +endchoice config SYSCTL bool diff --git a/init/do_mounts.c b/init/do_mounts.c index dea5de9..1fed726 100644 --- a/init/do_mounts.c +++ b/init/do_mounts.c @@ -597,6 +597,8 @@ void __init prepare_namespace(void) if (is_floppy && rd_doload && rd_load_disk(0)) ROOT_DEV = Root_RAM0; + check_resume_attempted(); + mount_root(); out: devtmpfs_mount("dev"); diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c index a1000ca..56b5a0c 100644 --- a/init/do_mounts_initrd.c +++ b/init/do_mounts_initrd.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -79,6 +80,11 @@ static void __init handle_initrd(void) current->flags &= ~PF_FREEZER_SKIP; + if (!resume_attempted) + printk(KERN_ERR "TuxOnIce: No attempt was made to resume from " + "any image that might exist.\n"); + clear_toi_state(TOI_BOOT_TIME); + /* move initrd to rootfs' /old */ sys_mount("..", ".", NULL, MS_MOVE, NULL); /* switch root and cwd back to / of rootfs */ diff --git a/ipc/shm.c b/ipc/shm.c index 331fc1b..12934ca 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -578,7 +578,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params) if ((shmflg & SHM_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) acctflag = VM_NORESERVE; - file = shmem_kernel_file_setup(name, size, acctflag); + file = shmem_kernel_file_setup(name, size, acctflag, 0); } error = PTR_ERR(file); if (IS_ERR(file)) diff --git a/kernel/debug/kdb/kdb_io.c b/kernel/debug/kdb/kdb_io.c index fc1ef73..6697a3d 100644 --- a/kernel/debug/kdb/kdb_io.c +++ b/kernel/debug/kdb/kdb_io.c @@ -710,7 +710,7 @@ kdb_printit: } } while (c) { - c->write(c, cp, retlen - (cp - kdb_buffer)); + c->write(c, cp, retlen - (cp - kdb_buffer), 7); /* 7 == KERN_DEBUG */ touch_nmi_watchdog(); c = c->next; } @@ -774,7 +774,7 @@ kdb_printit: } } while (c) { - c->write(c, moreprompt, strlen(moreprompt)); + c->write(c, moreprompt, strlen(moreprompt), 7); /* 7 == KERN_DEBUG */ touch_nmi_watchdog(); c = c->next; } diff --git a/kernel/fork.c b/kernel/fork.c index 2e391c7..bb061d0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -138,7 +138,7 @@ static struct kmem_cache *task_struct_cachep; static inline struct task_struct *alloc_task_struct_node(int node) { - return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node); + return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL | ___GFP_TOI_NOTRACK, node); } static inline void free_task_struct(struct task_struct *tsk) diff --git a/kernel/kthread.c b/kernel/kthread.c index 9ff173d..12d8a8f 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -275,7 +275,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data), DECLARE_COMPLETION_ONSTACK(done); struct task_struct *task; struct kthread_create_info *create = kmalloc(sizeof(*create), - GFP_KERNEL); + GFP_KERNEL | ___GFP_TOI_NOTRACK); if (!create) return ERR_PTR(-ENOMEM); diff --git a/kernel/latencytop.c b/kernel/latencytop.c index a028127..b5c30d9 100644 --- a/kernel/latencytop.c +++ b/kernel/latencytop.c @@ -47,12 +47,12 @@ * of times) */ -#include #include #include #include #include #include +#include #include #include #include @@ -289,4 +289,16 @@ static int __init init_lstats_procfs(void) proc_create("latency_stats", 0644, NULL, &lstats_fops); return 0; } + +int sysctl_latencytop(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int err; + + err = proc_dointvec(table, write, buffer, lenp, ppos); + if (latencytop_enabled) + force_schedstat_enabled(); + + return err; +} device_initcall(init_lstats_procfs); diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 68d3ebc..5f93a3c 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -101,6 +101,284 @@ config PM_STD_PARTITION suspended image to. It will simply pick the first available swap device. +menuconfig TOI_CORE + bool "Enhanced Hibernation (TuxOnIce)" + depends on HIBERNATION + default y + ---help--- + TuxOnIce is the 'new and improved' suspend support. + + See the TuxOnIce home page (tuxonice.net) + for FAQs, HOWTOs and other documentation. + + comment "Image Storage (you need at least one allocator)" + depends on TOI_CORE + + config TOI_FILE + bool "File Allocator" + depends on TOI_CORE + default y + ---help--- + This option enables support for storing an image in a + simple file. You might want this if your swap is + sometimes full enough that you don't have enough spare + space to store an image. + + config TOI_SWAP + bool "Swap Allocator" + depends on TOI_CORE && SWAP + default y + ---help--- + This option enables support for storing an image in your + swap space. + + comment "General Options" + depends on TOI_CORE + + config TOI_PRUNE + bool "Image pruning support" + depends on TOI_CORE && CRYPTO && BROKEN + default y + ---help--- + This option adds support for using cryptoapi hashing + algorithms to identify pages with the same content. We + then write a much smaller pointer to the first copy of + the data instead of a complete (perhaps compressed) + additional copy. + + You probably want this, so say Y here. + + comment "No image pruning support available without Cryptoapi support." + depends on TOI_CORE && !CRYPTO + + config TOI_CRYPTO + bool "Compression support" + depends on TOI_CORE && CRYPTO + default y + ---help--- + This option adds support for using cryptoapi compression + algorithms. Compression is particularly useful as it can + more than double your suspend and resume speed (depending + upon how well your image compresses). + + You probably want this, so say Y here. + + comment "No compression support available without Cryptoapi support." + depends on TOI_CORE && !CRYPTO + + config TOI_USERUI + bool "Userspace User Interface support" + depends on TOI_CORE && NET && (VT || SERIAL_CONSOLE) + default y + ---help--- + This option enabled support for a userspace based user interface + to TuxOnIce, which allows you to have a nice display while suspending + and resuming, and also enables features such as pressing escape to + cancel a cycle or interactive debugging. + + config TOI_USERUI_DEFAULT_PATH + string "Default userui program location" + default "/usr/local/sbin/tuxoniceui_text" + depends on TOI_USERUI + ---help--- + This entry allows you to specify a default path to the userui binary. + + config TOI_DEFAULT_IMAGE_SIZE_LIMIT + int "Default image size limit" + range -2 65536 + default "-2" + depends on TOI_CORE + ---help--- + This entry allows you to specify a default image size limit. It can + be overridden at run-time using /sys/power/tuxonice/image_size_limit. + + config TOI_KEEP_IMAGE + bool "Allow Keep Image Mode" + depends on TOI_CORE + ---help--- + This option allows you to keep and image and reuse it. It is intended + __ONLY__ for use with systems where all filesystems are mounted read- + only (kiosks, for example). To use it, compile this option in and boot + normally. Set the KEEP_IMAGE flag in /sys/power/tuxonice and suspend. + When you resume, the image will not be removed. You will be unable to turn + off swap partitions (assuming you are using the swap allocator), but future + suspends simply do a power-down. The image can be updated using the + kernel command line parameter suspend_act= to turn off the keep image + bit. Keep image mode is a little less user friendly on purpose - it + should not be used without thought! + + config TOI_INCREMENTAL + bool "Incremental Image Support" + depends on TOI_CORE && 64BIT && TOI_KEEP_IMAGE + default n + ---help--- + This option enables the work in progress toward using the dirty page + tracking to record changes to pages. It is hoped that + this will be an initial step toward implementing storing just + the differences between consecutive images, which will + increase the amount of storage needed for the image, but also + increase the speed at which writing an image occurs and + reduce the wear and tear on drives. + + At the moment, all that is implemented is the first step of keeping + an existing image and then comparing it to the contents in memory + (by setting /sys/power/tuxonice/verify_image to 1 and triggering a + (fake) resume) to see what the page change tracking should find to be + different. If you have verify_image set to 1, TuxOnIce will automatically + invalidate the old image when you next try to hibernate, so there's no + greater chance of disk corruption than normal. + + comment "No incremental image support available without Keep Image support." + depends on TOI_CORE && !TOI_KEEP_IMAGE && 64BIT + + config TOI_REPLACE_SWSUSP + bool "Replace swsusp by default" + default y + depends on TOI_CORE + ---help--- + TuxOnIce can replace swsusp. This option makes that the default state, + requiring you to echo 0 > /sys/power/tuxonice/replace_swsusp if you want + to use the vanilla kernel functionality. Note that your initrd/ramfs will + need to do this before trying to resume, too. + With overriding swsusp enabled, echoing disk to /sys/power/state will + start a TuxOnIce cycle. If resume= doesn't specify an allocator and both + the swap and file allocators are compiled in, the swap allocator will be + used by default. + + config TOI_IGNORE_LATE_INITCALL + bool "Wait for initrd/ramfs to run, by default" + default n + depends on TOI_CORE + ---help--- + When booting, TuxOnIce can check for an image and start to resume prior + to any initrd/ramfs running (via a late initcall). + + If you don't have an initrd/ramfs, this is what you want to happen - + otherwise you won't be able to safely resume. You should set this option + to 'No'. + + If, however, you want your initrd/ramfs to run anyway before resuming, + you need to tell TuxOnIce to ignore that earlier opportunity to resume. + This can be done either by using this compile time option, or by + overriding this option with the boot-time parameter toi_initramfs_resume_only=1. + + Note that if TuxOnIce can't resume at the earlier opportunity, the + value of this option won't matter - the initramfs/initrd (if any) will + run anyway. + + menuconfig TOI_CLUSTER + bool "Cluster support" + default n + depends on TOI_CORE && NET && BROKEN + ---help--- + Support for linking multiple machines in a cluster so that they suspend + and resume together. + + config TOI_DEFAULT_CLUSTER_INTERFACE + string "Default cluster interface" + depends on TOI_CLUSTER + ---help--- + The default interface on which to communicate with other nodes in + the cluster. + + If no value is set here, cluster support will be disabled by default. + + config TOI_DEFAULT_CLUSTER_KEY + string "Default cluster key" + default "Default" + depends on TOI_CLUSTER + ---help--- + The default key used by this node. All nodes in the same cluster + have the same key. Multiple clusters may coexist on the same lan + by using different values for this key. + + config TOI_CLUSTER_IMAGE_TIMEOUT + int "Timeout when checking for image" + default 15 + depends on TOI_CLUSTER + ---help--- + Timeout (seconds) before continuing to boot when waiting to see + whether other nodes might have an image. Set to -1 to wait + indefinitely. In WAIT_UNTIL_NODES is non zero, we might continue + booting sooner than this timeout. + + config TOI_CLUSTER_WAIT_UNTIL_NODES + int "Nodes without image before continuing" + default 0 + depends on TOI_CLUSTER + ---help--- + When booting and no image is found, we wait to see if other nodes + have an image before continuing to boot. This value lets us + continue after seeing a certain number of nodes without an image, + instead of continuing to wait for the timeout. Set to 0 to only + use the timeout. + + config TOI_DEFAULT_CLUSTER_PRE_HIBERNATE + string "Default pre-hibernate script" + depends on TOI_CLUSTER + ---help--- + The default script to be called when starting to hibernate. + + config TOI_DEFAULT_CLUSTER_POST_HIBERNATE + string "Default post-hibernate script" + depends on TOI_CLUSTER + ---help--- + The default script to be called after resuming from hibernation. + + config TOI_DEFAULT_WAIT + int "Default waiting time for emergency boot messages" + default "25" + range -1 32768 + depends on TOI_CORE + help + TuxOnIce can display warnings very early in the process of resuming, + if (for example) it appears that you have booted a kernel that doesn't + match an image on disk. It can then give you the opportunity to either + continue booting that kernel, or reboot the machine. This option can be + used to control how long to wait in such circumstances. -1 means wait + forever. 0 means don't wait at all (do the default action, which will + generally be to continue booting and remove the image). Values of 1 or + more indicate a number of seconds (up to 255) to wait before doing the + default. + + config TOI_DEFAULT_EXTRA_PAGES_ALLOWANCE + int "Default extra pages allowance" + default "2000" + range 500 32768 + depends on TOI_CORE + help + This value controls the default for the allowance TuxOnIce makes for + drivers to allocate extra memory during the atomic copy. The default + value of 2000 will be okay in most cases. If you are using + DRI, the easiest way to find what value to use is to try to hibernate + and look at how many pages were actually needed in the sysfs entry + /sys/power/tuxonice/debug_info (first number on the last line), adding + a little extra because the value is not always the same. + + config TOI_CHECKSUM + bool "Checksum pageset2" + default n + depends on TOI_CORE + select CRYPTO + select CRYPTO_ALGAPI + select CRYPTO_MD4 + ---help--- + Adds support for checksumming pageset2 pages, to ensure you really get an + atomic copy. Since some filesystems (XFS especially) change metadata even + when there's no other activity, we need this to check for pages that have + been changed while we were saving the page cache. If your debugging output + always says no pages were resaved, you may be able to safely disable this + option. + +config TOI + bool + depends on TOI_CORE!=n + default y + +config TOI_ZRAM_SUPPORT + def_bool y + depends on TOI && ZRAM!=n + config PM_SLEEP def_bool y depends on SUSPEND || HIBERNATE_CALLBACKS diff --git a/kernel/power/Makefile b/kernel/power/Makefile index cb880a1..82c4795 100644 --- a/kernel/power/Makefile +++ b/kernel/power/Makefile @@ -1,6 +1,38 @@ ccflags-$(CONFIG_PM_DEBUG) := -DDEBUG +tuxonice_core-y := tuxonice_modules.o + +obj-$(CONFIG_TOI) += tuxonice_builtin.o +obj-$(CONFIG_TOI_INCREMENTAL) += tuxonice_incremental.o \ + tuxonice_copy_before_write.o + +tuxonice_core-$(CONFIG_PM_DEBUG) += tuxonice_alloc.o + +# Compile these in after allocation debugging, if used. + +tuxonice_core-y += tuxonice_sysfs.o tuxonice_highlevel.o \ + tuxonice_io.o tuxonice_pagedir.o tuxonice_prepare_image.o \ + tuxonice_extent.o tuxonice_pageflags.o tuxonice_ui.o \ + tuxonice_power_off.o tuxonice_atomic_copy.o + +tuxonice_core-$(CONFIG_TOI_CHECKSUM) += tuxonice_checksum.o + +tuxonice_core-$(CONFIG_NET) += tuxonice_storage.o tuxonice_netlink.o + +obj-$(CONFIG_TOI_CORE) += tuxonice_core.o +obj-$(CONFIG_TOI_PRUNE) += tuxonice_prune.o +obj-$(CONFIG_TOI_CRYPTO) += tuxonice_compress.o + +tuxonice_bio-y := tuxonice_bio_core.o tuxonice_bio_chains.o \ + tuxonice_bio_signature.o + +obj-$(CONFIG_TOI_SWAP) += tuxonice_bio.o tuxonice_swap.o +obj-$(CONFIG_TOI_FILE) += tuxonice_bio.o tuxonice_file.o +obj-$(CONFIG_TOI_CLUSTER) += tuxonice_cluster.o + +obj-$(CONFIG_TOI_USERUI) += tuxonice_userui.o + obj-y += qos.o obj-$(CONFIG_PM) += main.o obj-$(CONFIG_VT_CONSOLE_SLEEP) += console.o diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index b7342a2..153e51d 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -31,7 +31,7 @@ #include #include -#include "power.h" +#include "tuxonice.h" static int nocompress; @@ -39,7 +39,7 @@ static int noresume; static int nohibernate; static int resume_wait; static unsigned int resume_delay; -static char resume_file[256] = CONFIG_PM_STD_PARTITION; +char resume_file[256] = CONFIG_PM_STD_PARTITION; dev_t swsusp_resume_device; sector_t swsusp_resume_block; __visible int in_suspend __nosavedata; @@ -123,7 +123,7 @@ static int hibernation_test(int level) { return 0; } * platform_begin - Call platform to start hibernation. * @platform_mode: Whether or not to use the platform driver. */ -static int platform_begin(int platform_mode) +int platform_begin(int platform_mode) { return (platform_mode && hibernation_ops) ? hibernation_ops->begin() : 0; @@ -133,7 +133,7 @@ static int platform_begin(int platform_mode) * platform_end - Call platform to finish transition to the working state. * @platform_mode: Whether or not to use the platform driver. */ -static void platform_end(int platform_mode) +void platform_end(int platform_mode) { if (platform_mode && hibernation_ops) hibernation_ops->end(); @@ -147,7 +147,7 @@ static void platform_end(int platform_mode) * if so configured, and return an error code if that fails. */ -static int platform_pre_snapshot(int platform_mode) +int platform_pre_snapshot(int platform_mode) { return (platform_mode && hibernation_ops) ? hibernation_ops->pre_snapshot() : 0; @@ -162,7 +162,7 @@ static int platform_pre_snapshot(int platform_mode) * * This routine is called on one CPU with interrupts disabled. */ -static void platform_leave(int platform_mode) +void platform_leave(int platform_mode) { if (platform_mode && hibernation_ops) hibernation_ops->leave(); @@ -177,7 +177,7 @@ static void platform_leave(int platform_mode) * * This routine must be called after platform_prepare(). */ -static void platform_finish(int platform_mode) +void platform_finish(int platform_mode) { if (platform_mode && hibernation_ops) hibernation_ops->finish(); @@ -193,7 +193,7 @@ static void platform_finish(int platform_mode) * If the restore fails after this function has been called, * platform_restore_cleanup() must be called. */ -static int platform_pre_restore(int platform_mode) +int platform_pre_restore(int platform_mode) { return (platform_mode && hibernation_ops) ? hibernation_ops->pre_restore() : 0; @@ -210,7 +210,7 @@ static int platform_pre_restore(int platform_mode) * function must be called too, regardless of the result of * platform_pre_restore(). */ -static void platform_restore_cleanup(int platform_mode) +void platform_restore_cleanup(int platform_mode) { if (platform_mode && hibernation_ops) hibernation_ops->restore_cleanup(); @@ -220,7 +220,7 @@ static void platform_restore_cleanup(int platform_mode) * platform_recover - Recover from a failure to suspend devices. * @platform_mode: Whether or not to use the platform driver. */ -static void platform_recover(int platform_mode) +void platform_recover(int platform_mode) { if (platform_mode && hibernation_ops && hibernation_ops->recover) hibernation_ops->recover(); @@ -648,6 +648,9 @@ int hibernate(void) { int error; + if (test_action_state(TOI_REPLACE_SWSUSP)) + return try_tuxonice_hibernate(); + if (!hibernation_available()) { pr_debug("PM: Hibernation not available.\n"); return -EPERM; @@ -737,11 +740,19 @@ int hibernate(void) * attempts to recover gracefully and make the kernel return to the normal mode * of operation. */ -static int software_resume(void) +int software_resume(void) { int error; unsigned int flags; + resume_attempted = 1; + + /* + * We can't know (until an image header - if any - is loaded), whether + * we did override swsusp. We therefore ensure that both are tried. + */ + try_tuxonice_resume(); + /* * If the user said "noresume".. bail out early. */ @@ -1128,6 +1139,7 @@ static int __init hibernate_setup(char *str) static int __init noresume_setup(char *str) { noresume = 1; + set_toi_state(TOI_NORESUME_SPECIFIED); return 1; } diff --git a/kernel/power/power.h b/kernel/power/power.h index efe1b3b..2577c6d 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -36,8 +36,12 @@ static inline char *check_image_kernel(struct swsusp_info *info) return arch_hibernation_header_restore(info) ? "architecture specific data" : NULL; } +#else +extern char *check_image_kernel(struct swsusp_info *info); #endif /* CONFIG_ARCH_HIBERNATION_HEADER */ +extern int init_header(struct swsusp_info *info); +extern char resume_file[256]; /* * Keep some memory free so that I/O operations can succeed without paging * [Might this be more than 4 MB?] @@ -86,6 +90,8 @@ static struct kobj_attribute _name##_attr = { \ .show = _name##_show, \ } +extern struct pbe *restore_pblist; + /* Preferred image size in bytes (default 500 MB) */ extern unsigned long image_size; /* Size of memory reserved for drivers (default SPARE_PAGES x PAGE_SIZE) */ @@ -269,6 +275,31 @@ static inline void suspend_thaw_processes(void) } #endif +extern struct page *saveable_page(struct zone *z, unsigned long p); +#ifdef CONFIG_HIGHMEM +struct page *saveable_highmem_page(struct zone *z, unsigned long p); +#else +static +inline void *saveable_highmem_page(struct zone *z, unsigned long p) +{ + return NULL; +} +#endif + +#define PBES_PER_PAGE (PAGE_SIZE / sizeof(struct pbe)) +extern struct list_head nosave_regions; + +/** + * This structure represents a range of page frames the contents of which + * should not be saved during the suspend. + */ + +struct nosave_region { + struct list_head list; + unsigned long start_pfn; + unsigned long end_pfn; +}; + #ifdef CONFIG_PM_AUTOSLEEP /* kernel/power/autosleep.c */ @@ -295,3 +326,10 @@ extern int pm_wake_lock(const char *buf); extern int pm_wake_unlock(const char *buf); #endif /* !CONFIG_PM_WAKELOCKS */ + +#ifdef CONFIG_TOI +unsigned long toi_get_nonconflicting_page(void); +#define BM_END_OF_MAP (~0UL) +#else +#define toi_get_nonconflicting_page() (0) +#endif diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 3a97060..542163a 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -36,6 +36,9 @@ #include #include +#include "tuxonice_modules.h" +#include "tuxonice_builtin.h" +#include "tuxonice_alloc.h" #include "power.h" static int swsusp_page_is_free(struct page *); @@ -98,6 +101,9 @@ static void *get_image_page(gfp_t gfp_mask, int safe_needed) { void *res; + if (toi_running) + return (void *) toi_get_nonconflicting_page(); + res = (void *)get_zeroed_page(gfp_mask); if (safe_needed) while (res && swsusp_page_is_free(virt_to_page(res))) { @@ -143,6 +149,11 @@ static inline void free_image_page(void *addr, int clear_nosave_free) page = virt_to_page(addr); + if (toi_running) { + toi__free_page(29, page); + return; + } + swsusp_unset_page_forbidden(page); if (clear_nosave_free) swsusp_unset_page_free(page); @@ -302,13 +313,15 @@ struct bm_position { int node_bit; }; +#define BM_POSITION_SLOTS (NR_CPUS * 2) + struct memory_bitmap { struct list_head zones; struct linked_page *p_list; /* list of pages used to store zone * bitmap objects and bitmap block * objects */ - struct bm_position cur; /* most recently used bit position */ + struct bm_position cur[BM_POSITION_SLOTS]; /* most recently used bit position */ }; /* Functions that operate on memory bitmaps */ @@ -473,16 +486,39 @@ static void free_zone_bm_rtree(struct mem_zone_bm_rtree *zone, free_image_page(node->data, clear_nosave_free); } -static void memory_bm_position_reset(struct memory_bitmap *bm) +void memory_bm_position_reset(struct memory_bitmap *bm) { - bm->cur.zone = list_entry(bm->zones.next, struct mem_zone_bm_rtree, + int index; + + for (index = 0; index < BM_POSITION_SLOTS; index++) { + bm->cur[index].zone = list_entry(bm->zones.next, struct mem_zone_bm_rtree, list); - bm->cur.node = list_entry(bm->cur.zone->leaves.next, + bm->cur[index].node = list_entry(bm->cur[index].zone->leaves.next, struct rtree_node, list); - bm->cur.node_pfn = 0; - bm->cur.node_bit = 0; + bm->cur[index].node_pfn = 0; + bm->cur[index].node_bit = 0; + } } +static void memory_bm_clear_current(struct memory_bitmap *bm, int index); +unsigned long memory_bm_next_pfn(struct memory_bitmap *bm, int index); + +/** + * memory_bm_clear + * @param bm - The bitmap to clear + * + * Only run while single threaded - locking not needed + */ +void memory_bm_clear(struct memory_bitmap *bm) +{ + memory_bm_position_reset(bm); + + while (memory_bm_next_pfn(bm, 0) != BM_END_OF_MAP) { + memory_bm_clear_current(bm, 0); + } + + memory_bm_position_reset(bm); +} static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free); struct mem_extent { @@ -595,7 +631,8 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed) } bm->p_list = ca.chain; - memory_bm_position_reset(bm); + + memory_bm_position_reset(bm); Exit: free_mem_extents(&mem_extents); return error; @@ -631,14 +668,24 @@ static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free) * It walks the radix tree to find the page which contains the bit for * pfn and returns the bit position in **addr and *bit_nr. */ -static int memory_bm_find_bit(struct memory_bitmap *bm, unsigned long pfn, - void **addr, unsigned int *bit_nr) +int memory_bm_find_bit(struct memory_bitmap *bm, int index, + unsigned long pfn, void **addr, unsigned int *bit_nr) { struct mem_zone_bm_rtree *curr, *zone; struct rtree_node *node; int i, block_nr; - zone = bm->cur.zone; + if (!bm->cur[index].zone) { + // Reset + bm->cur[index].zone = list_entry(bm->zones.next, struct mem_zone_bm_rtree, + list); + bm->cur[index].node = list_entry(bm->cur[index].zone->leaves.next, + struct rtree_node, list); + bm->cur[index].node_pfn = 0; + bm->cur[index].node_bit = 0; + } + + zone = bm->cur[index].zone; if (pfn >= zone->start_pfn && pfn < zone->end_pfn) goto zone_found; @@ -662,8 +709,8 @@ zone_found: * node for our pfn. */ - node = bm->cur.node; - if (((pfn - zone->start_pfn) & ~BM_BLOCK_MASK) == bm->cur.node_pfn) + node = bm->cur[index].node; + if (((pfn - zone->start_pfn) & ~BM_BLOCK_MASK) == bm->cur[index].node_pfn) goto node_found; node = zone->rtree; @@ -680,9 +727,9 @@ zone_found: node_found: /* Update last position */ - bm->cur.zone = zone; - bm->cur.node = node; - bm->cur.node_pfn = (pfn - zone->start_pfn) & ~BM_BLOCK_MASK; + bm->cur[index].zone = zone; + bm->cur[index].node = node; + bm->cur[index].node_pfn = (pfn - zone->start_pfn) & ~BM_BLOCK_MASK; /* Set return values */ *addr = node->data; @@ -691,66 +738,66 @@ node_found: return 0; } -static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn) +void memory_bm_set_bit(struct memory_bitmap *bm, int index, unsigned long pfn) { void *addr; unsigned int bit; int error; - error = memory_bm_find_bit(bm, pfn, &addr, &bit); + error = memory_bm_find_bit(bm, index, pfn, &addr, &bit); BUG_ON(error); set_bit(bit, addr); } -static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn) +int mem_bm_set_bit_check(struct memory_bitmap *bm, int index, unsigned long pfn) { void *addr; unsigned int bit; int error; - error = memory_bm_find_bit(bm, pfn, &addr, &bit); + error = memory_bm_find_bit(bm, index, pfn, &addr, &bit); if (!error) set_bit(bit, addr); return error; } -static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn) +void memory_bm_clear_bit(struct memory_bitmap *bm, int index, unsigned long pfn) { void *addr; unsigned int bit; int error; - error = memory_bm_find_bit(bm, pfn, &addr, &bit); + error = memory_bm_find_bit(bm, index, pfn, &addr, &bit); BUG_ON(error); clear_bit(bit, addr); } -static void memory_bm_clear_current(struct memory_bitmap *bm) +static void memory_bm_clear_current(struct memory_bitmap *bm, int index) { int bit; - bit = max(bm->cur.node_bit - 1, 0); - clear_bit(bit, bm->cur.node->data); + bit = max(bm->cur[index].node_bit - 1, 0); + clear_bit(bit, bm->cur[index].node->data); } -static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn) +int memory_bm_test_bit(struct memory_bitmap *bm, int index, unsigned long pfn) { void *addr; unsigned int bit; int error; - error = memory_bm_find_bit(bm, pfn, &addr, &bit); + error = memory_bm_find_bit(bm, index, pfn, &addr, &bit); BUG_ON(error); return test_bit(bit, addr); } -static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn) +static bool memory_bm_pfn_present(struct memory_bitmap *bm, int index, unsigned long pfn) { void *addr; unsigned int bit; - return !memory_bm_find_bit(bm, pfn, &addr, &bit); + return !memory_bm_find_bit(bm, index, pfn, &addr, &bit); } /* @@ -763,25 +810,25 @@ static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn) * * Returns true if there is a next node, false otherwise. */ -static bool rtree_next_node(struct memory_bitmap *bm) +static bool rtree_next_node(struct memory_bitmap *bm, int index) { - bm->cur.node = list_entry(bm->cur.node->list.next, + bm->cur[index].node = list_entry(bm->cur[index].node->list.next, struct rtree_node, list); - if (&bm->cur.node->list != &bm->cur.zone->leaves) { - bm->cur.node_pfn += BM_BITS_PER_BLOCK; - bm->cur.node_bit = 0; + if (&bm->cur[index].node->list != &bm->cur[index].zone->leaves) { + bm->cur[index].node_pfn += BM_BITS_PER_BLOCK; + bm->cur[index].node_bit = 0; touch_softlockup_watchdog(); return true; } /* No more nodes, goto next zone */ - bm->cur.zone = list_entry(bm->cur.zone->list.next, + bm->cur[index].zone = list_entry(bm->cur[index].zone->list.next, struct mem_zone_bm_rtree, list); - if (&bm->cur.zone->list != &bm->zones) { - bm->cur.node = list_entry(bm->cur.zone->leaves.next, + if (&bm->cur[index].zone->list != &bm->zones) { + bm->cur[index].node = list_entry(bm->cur[index].zone->leaves.next, struct rtree_node, list); - bm->cur.node_pfn = 0; - bm->cur.node_bit = 0; + bm->cur[index].node_pfn = 0; + bm->cur[index].node_bit = 0; return true; } @@ -799,38 +846,29 @@ static bool rtree_next_node(struct memory_bitmap *bm) * It is required to run memory_bm_position_reset() before the * first call to this function. */ -static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm) +unsigned long memory_bm_next_pfn(struct memory_bitmap *bm, int index) { unsigned long bits, pfn, pages; int bit; + index += NR_CPUS; /* Iteration state is separated from get/set/test */ + do { - pages = bm->cur.zone->end_pfn - bm->cur.zone->start_pfn; - bits = min(pages - bm->cur.node_pfn, BM_BITS_PER_BLOCK); - bit = find_next_bit(bm->cur.node->data, bits, - bm->cur.node_bit); + pages = bm->cur[index].zone->end_pfn - bm->cur[index].zone->start_pfn; + bits = min(pages - bm->cur[index].node_pfn, BM_BITS_PER_BLOCK); + bit = find_next_bit(bm->cur[index].node->data, bits, + bm->cur[index].node_bit); if (bit < bits) { - pfn = bm->cur.zone->start_pfn + bm->cur.node_pfn + bit; - bm->cur.node_bit = bit + 1; + pfn = bm->cur[index].zone->start_pfn + bm->cur[index].node_pfn + bit; + bm->cur[index].node_bit = bit + 1; return pfn; } - } while (rtree_next_node(bm)); + } while (rtree_next_node(bm, index)); return BM_END_OF_MAP; } -/** - * This structure represents a range of page frames the contents of which - * should not be saved during the suspend. - */ - -struct nosave_region { - struct list_head list; - unsigned long start_pfn; - unsigned long end_pfn; -}; - -static LIST_HEAD(nosave_regions); +LIST_HEAD(nosave_regions); /** * register_nosave_region - register a range of page frames the contents @@ -889,37 +927,37 @@ static struct memory_bitmap *free_pages_map; void swsusp_set_page_free(struct page *page) { if (free_pages_map) - memory_bm_set_bit(free_pages_map, page_to_pfn(page)); + memory_bm_set_bit(free_pages_map, 0, page_to_pfn(page)); } static int swsusp_page_is_free(struct page *page) { return free_pages_map ? - memory_bm_test_bit(free_pages_map, page_to_pfn(page)) : 0; + memory_bm_test_bit(free_pages_map, 0, page_to_pfn(page)) : 0; } void swsusp_unset_page_free(struct page *page) { if (free_pages_map) - memory_bm_clear_bit(free_pages_map, page_to_pfn(page)); + memory_bm_clear_bit(free_pages_map, 0, page_to_pfn(page)); } static void swsusp_set_page_forbidden(struct page *page) { if (forbidden_pages_map) - memory_bm_set_bit(forbidden_pages_map, page_to_pfn(page)); + memory_bm_set_bit(forbidden_pages_map, 0, page_to_pfn(page)); } int swsusp_page_is_forbidden(struct page *page) { return forbidden_pages_map ? - memory_bm_test_bit(forbidden_pages_map, page_to_pfn(page)) : 0; + memory_bm_test_bit(forbidden_pages_map, 0, page_to_pfn(page)) : 0; } static void swsusp_unset_page_forbidden(struct page *page) { if (forbidden_pages_map) - memory_bm_clear_bit(forbidden_pages_map, page_to_pfn(page)); + memory_bm_clear_bit(forbidden_pages_map, 0, page_to_pfn(page)); } /** @@ -950,7 +988,7 @@ static void mark_nosave_pages(struct memory_bitmap *bm) * touch the PFNs for which the error is * returned anyway. */ - mem_bm_set_bit_check(bm, pfn); + mem_bm_set_bit_check(bm, 0, pfn); } } } @@ -1078,7 +1116,7 @@ static unsigned int count_free_highmem_pages(void) * We should save the page if it isn't Nosave or NosaveFree, or Reserved, * and it isn't a part of a free chunk of pages. */ -static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) +struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) { struct page *page; @@ -1125,11 +1163,6 @@ static unsigned int count_highmem_pages(void) } return n; } -#else -static inline void *saveable_highmem_page(struct zone *z, unsigned long p) -{ - return NULL; -} #endif /* CONFIG_HIGHMEM */ /** @@ -1140,7 +1173,7 @@ static inline void *saveable_highmem_page(struct zone *z, unsigned long p) * of pages statically defined as 'unsaveable', and it isn't a part of * a free chunk of pages. */ -static struct page *saveable_page(struct zone *zone, unsigned long pfn) +struct page *saveable_page(struct zone *zone, unsigned long pfn) { struct page *page; @@ -1278,15 +1311,15 @@ copy_data_pages(struct memory_bitmap *copy_bm, struct memory_bitmap *orig_bm) max_zone_pfn = zone_end_pfn(zone); for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) if (page_is_saveable(zone, pfn)) - memory_bm_set_bit(orig_bm, pfn); + memory_bm_set_bit(orig_bm, 0, pfn); } memory_bm_position_reset(orig_bm); memory_bm_position_reset(copy_bm); for(;;) { - pfn = memory_bm_next_pfn(orig_bm); + pfn = memory_bm_next_pfn(orig_bm, 0); if (unlikely(pfn == BM_END_OF_MAP)) break; - copy_data_page(memory_bm_next_pfn(copy_bm), pfn); + copy_data_page(memory_bm_next_pfn(copy_bm, 0), pfn); } } @@ -1332,8 +1365,8 @@ void swsusp_free(void) memory_bm_position_reset(free_pages_map); loop: - fr_pfn = memory_bm_next_pfn(free_pages_map); - fb_pfn = memory_bm_next_pfn(forbidden_pages_map); + fr_pfn = memory_bm_next_pfn(free_pages_map, 0); + fb_pfn = memory_bm_next_pfn(forbidden_pages_map, 0); /* * Find the next bit set in both bitmaps. This is guaranteed to @@ -1341,16 +1374,16 @@ loop: */ do { if (fb_pfn < fr_pfn) - fb_pfn = memory_bm_next_pfn(forbidden_pages_map); + fb_pfn = memory_bm_next_pfn(forbidden_pages_map, 0); if (fr_pfn < fb_pfn) - fr_pfn = memory_bm_next_pfn(free_pages_map); + fr_pfn = memory_bm_next_pfn(free_pages_map, 0); } while (fb_pfn != fr_pfn); if (fr_pfn != BM_END_OF_MAP && pfn_valid(fr_pfn)) { struct page *page = pfn_to_page(fr_pfn); - memory_bm_clear_current(forbidden_pages_map); - memory_bm_clear_current(free_pages_map); + memory_bm_clear_current(forbidden_pages_map, 0); + memory_bm_clear_current(free_pages_map, 0); __free_page(page); goto loop; } @@ -1385,7 +1418,7 @@ static unsigned long preallocate_image_pages(unsigned long nr_pages, gfp_t mask) page = alloc_image_page(mask); if (!page) break; - memory_bm_set_bit(©_bm, page_to_pfn(page)); + memory_bm_set_bit(©_bm, 0, page_to_pfn(page)); if (PageHighMem(page)) alloc_highmem++; else @@ -1481,7 +1514,7 @@ static unsigned long free_unnecessary_pages(void) memory_bm_position_reset(©_bm); while (to_free_normal > 0 || to_free_highmem > 0) { - unsigned long pfn = memory_bm_next_pfn(©_bm); + unsigned long pfn = memory_bm_next_pfn(©_bm, 0); struct page *page = pfn_to_page(pfn); if (PageHighMem(page)) { @@ -1495,7 +1528,7 @@ static unsigned long free_unnecessary_pages(void) to_free_normal--; alloc_normal--; } - memory_bm_clear_bit(©_bm, pfn); + memory_bm_clear_bit(©_bm, 0, pfn); swsusp_unset_page_forbidden(page); swsusp_unset_page_free(page); __free_page(page); @@ -1780,7 +1813,7 @@ alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem) struct page *page; page = alloc_image_page(__GFP_HIGHMEM|__GFP_KSWAPD_RECLAIM); - memory_bm_set_bit(bm, page_to_pfn(page)); + memory_bm_set_bit(bm, 0, page_to_pfn(page)); } return nr_highmem; } @@ -1823,7 +1856,7 @@ swsusp_alloc(struct memory_bitmap *orig_bm, struct memory_bitmap *copy_bm, page = alloc_image_page(GFP_ATOMIC | __GFP_COLD); if (!page) goto err_out; - memory_bm_set_bit(copy_bm, page_to_pfn(page)); + memory_bm_set_bit(copy_bm, 0, page_to_pfn(page)); } } @@ -1838,6 +1871,9 @@ asmlinkage __visible int swsusp_save(void) { unsigned int nr_pages, nr_highmem; + if (toi_running) + return toi_post_context_save(); + printk(KERN_INFO "PM: Creating hibernation image:\n"); drain_local_pages(NULL); @@ -1885,7 +1921,7 @@ static int init_header_complete(struct swsusp_info *info) return 0; } -static char *check_image_kernel(struct swsusp_info *info) +char *check_image_kernel(struct swsusp_info *info) { if (info->version_code != LINUX_VERSION_CODE) return "kernel version"; @@ -1906,7 +1942,7 @@ unsigned long snapshot_get_image_size(void) return nr_copy_pages + nr_meta_pages + 1; } -static int init_header(struct swsusp_info *info) +int init_header(struct swsusp_info *info) { memset(info, 0, sizeof(struct swsusp_info)); info->num_physpages = get_num_physpages(); @@ -1928,7 +1964,7 @@ pack_pfns(unsigned long *buf, struct memory_bitmap *bm) int j; for (j = 0; j < PAGE_SIZE / sizeof(long); j++) { - buf[j] = memory_bm_next_pfn(bm); + buf[j] = memory_bm_next_pfn(bm, 0); if (unlikely(buf[j] == BM_END_OF_MAP)) break; /* Save page key for data page (s390 only). */ @@ -1979,7 +2015,7 @@ int snapshot_read_next(struct snapshot_handle *handle) } else { struct page *page; - page = pfn_to_page(memory_bm_next_pfn(©_bm)); + page = pfn_to_page(memory_bm_next_pfn(©_bm, 0)); if (PageHighMem(page)) { /* Highmem pages are copied to the buffer, * because we can't return with a kmapped @@ -2021,7 +2057,7 @@ static int mark_unsafe_pages(struct memory_bitmap *bm) /* Mark pages that correspond to the "original" pfns as "unsafe" */ memory_bm_position_reset(bm); do { - pfn = memory_bm_next_pfn(bm); + pfn = memory_bm_next_pfn(bm, 0); if (likely(pfn != BM_END_OF_MAP)) { if (likely(pfn_valid(pfn))) swsusp_set_page_free(pfn_to_page(pfn)); @@ -2041,10 +2077,10 @@ duplicate_memory_bitmap(struct memory_bitmap *dst, struct memory_bitmap *src) unsigned long pfn; memory_bm_position_reset(src); - pfn = memory_bm_next_pfn(src); + pfn = memory_bm_next_pfn(src, 0); while (pfn != BM_END_OF_MAP) { - memory_bm_set_bit(dst, pfn); - pfn = memory_bm_next_pfn(src); + memory_bm_set_bit(dst, 0, pfn); + pfn = memory_bm_next_pfn(src, 0); } } @@ -2095,8 +2131,8 @@ static int unpack_orig_pfns(unsigned long *buf, struct memory_bitmap *bm) /* Extract and buffer page key for data page (s390 only). */ page_key_memorize(buf + j); - if (memory_bm_pfn_present(bm, buf[j])) - memory_bm_set_bit(bm, buf[j]); + if (memory_bm_pfn_present(bm, 0, buf[j])) + memory_bm_set_bit(bm, 0, buf[j]); else return -EFAULT; } @@ -2139,12 +2175,12 @@ static unsigned int count_highmem_image_pages(struct memory_bitmap *bm) unsigned int cnt = 0; memory_bm_position_reset(bm); - pfn = memory_bm_next_pfn(bm); + pfn = memory_bm_next_pfn(bm, 0); while (pfn != BM_END_OF_MAP) { if (PageHighMem(pfn_to_page(pfn))) cnt++; - pfn = memory_bm_next_pfn(bm); + pfn = memory_bm_next_pfn(bm, 0); } return cnt; } @@ -2189,7 +2225,7 @@ prepare_highmem_image(struct memory_bitmap *bm, unsigned int *nr_highmem_p) page = alloc_page(__GFP_HIGHMEM); if (!swsusp_page_is_free(page)) { /* The page is "safe", set its bit the bitmap */ - memory_bm_set_bit(bm, page_to_pfn(page)); + memory_bm_set_bit(bm, 0, page_to_pfn(page)); safe_highmem_pages++; } /* Mark the page as allocated */ @@ -2247,7 +2283,7 @@ get_highmem_page_buffer(struct page *page, struct chain_allocator *ca) /* Copy of the page will be stored in high memory */ kaddr = buffer; - tmp = pfn_to_page(memory_bm_next_pfn(safe_highmem_bm)); + tmp = pfn_to_page(memory_bm_next_pfn(safe_highmem_bm, 0)); safe_highmem_pages--; last_highmem_page = tmp; pbe->copy_page = tmp; @@ -2418,7 +2454,7 @@ static void *get_buffer(struct memory_bitmap *bm, struct chain_allocator *ca) { struct pbe *pbe; struct page *page; - unsigned long pfn = memory_bm_next_pfn(bm); + unsigned long pfn = memory_bm_next_pfn(bm, 0); if (pfn == BM_END_OF_MAP) return ERR_PTR(-EFAULT); @@ -2605,3 +2641,82 @@ int restore_highmem(void) return 0; } #endif /* CONFIG_HIGHMEM */ + +struct memory_bitmap *pageset1_map, *pageset2_map, *free_map, *nosave_map, + *pageset1_copy_map, *io_map, *page_resave_map, *compare_map; + +int resume_attempted; + +int memory_bm_write(struct memory_bitmap *bm, int (*rw_chunk) + (int rw, struct toi_module_ops *owner, char *buffer, int buffer_size)) +{ + int result; + + memory_bm_position_reset(bm); + + do { + result = rw_chunk(WRITE, NULL, (char *) bm->cur[0].node->data, PAGE_SIZE); + + if (result) + return result; + } while (rtree_next_node(bm, 0)); + return 0; +} + +int memory_bm_read(struct memory_bitmap *bm, int (*rw_chunk) + (int rw, struct toi_module_ops *owner, char *buffer, int buffer_size)) +{ + int result; + + memory_bm_position_reset(bm); + + do { + result = rw_chunk(READ, NULL, (char *) bm->cur[0].node->data, PAGE_SIZE); + + if (result) + return result; + + } while (rtree_next_node(bm, 0)); + return 0; +} + +int memory_bm_space_needed(struct memory_bitmap *bm) +{ + unsigned long bytes = 0; + + memory_bm_position_reset(bm); + do { + bytes += PAGE_SIZE; + } while (rtree_next_node(bm, 0)); + return bytes; +} + +int toi_alloc_bitmap(struct memory_bitmap **bm) +{ + int error; + struct memory_bitmap *bm1; + + bm1 = kzalloc(sizeof(struct memory_bitmap), GFP_KERNEL); + if (!bm1) + return -ENOMEM; + + error = memory_bm_create(bm1, GFP_KERNEL, PG_ANY); + if (error) { + printk("Error returned - %d.\n", error); + kfree(bm1); + return -ENOMEM; + } + + *bm = bm1; + return 0; +} + +void toi_free_bitmap(struct memory_bitmap **bm) +{ + if (!*bm) + return; + + memory_bm_free(*bm, 0); + kfree(*bm); + *bm = NULL; +} diff --git b/kernel/power/tuxonice.h b/kernel/power/tuxonice.h new file mode 100644 index 0000000..10b6563 --- /dev/null +++ b/kernel/power/tuxonice.h @@ -0,0 +1,260 @@ +/* + * kernel/power/tuxonice.h + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * It contains declarations used throughout swsusp. + * + */ + +#ifndef KERNEL_POWER_TOI_H +#define KERNEL_POWER_TOI_H + +#include +#include +#include +#include +#include +#include "tuxonice_pageflags.h" +#include "power.h" + +#define TOI_CORE_VERSION "3.3" +#define TOI_HEADER_VERSION 3 +#define MY_BOOT_KERNEL_DATA_VERSION 4 + +struct toi_boot_kernel_data { + int version; + int size; + unsigned long toi_action; + unsigned long toi_debug_state; + u32 toi_default_console_level; + int toi_io_time[2][2]; + char toi_nosave_commandline[COMMAND_LINE_SIZE]; + unsigned long pages_used[33]; + unsigned long incremental_bytes_in; + unsigned long incremental_bytes_out; + unsigned long compress_bytes_in; + unsigned long compress_bytes_out; + unsigned long pruned_pages; +}; + +extern struct toi_boot_kernel_data toi_bkd; + +/* Location of book kernel data struct in kernel being resumed */ +extern unsigned long boot_kernel_data_buffer; + +/* == Action states == */ + +enum { + TOI_REBOOT, + TOI_PAUSE, + TOI_LOGALL, + TOI_CAN_CANCEL, + TOI_KEEP_IMAGE, + TOI_FREEZER_TEST, + TOI_SINGLESTEP, + TOI_PAUSE_NEAR_PAGESET_END, + TOI_TEST_FILTER_SPEED, + TOI_TEST_BIO, + TOI_NO_PAGESET2, + TOI_IGNORE_ROOTFS, + TOI_REPLACE_SWSUSP, + TOI_PAGESET2_FULL, + TOI_ABORT_ON_RESAVE_NEEDED, + TOI_NO_MULTITHREADED_IO, + TOI_NO_DIRECT_LOAD, /* Obsolete */ + TOI_LATE_CPU_HOTPLUG, /* Obsolete */ + TOI_GET_MAX_MEM_ALLOCD, + TOI_NO_FLUSHER_THREAD, + TOI_NO_PS2_IF_UNNEEDED, + TOI_POST_RESUME_BREAKPOINT, + TOI_NO_READAHEAD, + TOI_TRACE_DEBUG_ON, + TOI_INCREMENTAL_IMAGE, +}; + +extern unsigned long toi_bootflags_mask; + +#define clear_action_state(bit) (test_and_clear_bit(bit, &toi_bkd.toi_action)) + +/* == Result states == */ + +enum { + TOI_ABORTED, + TOI_ABORT_REQUESTED, + TOI_NOSTORAGE_AVAILABLE, + TOI_INSUFFICIENT_STORAGE, + TOI_FREEZING_FAILED, + TOI_KEPT_IMAGE, + TOI_WOULD_EAT_MEMORY, + TOI_UNABLE_TO_FREE_ENOUGH_MEMORY, + TOI_PM_SEM, + TOI_DEVICE_REFUSED, + TOI_SYSDEV_REFUSED, + TOI_EXTRA_PAGES_ALLOW_TOO_SMALL, + TOI_UNABLE_TO_PREPARE_IMAGE, + TOI_FAILED_MODULE_INIT, + TOI_FAILED_MODULE_CLEANUP, + TOI_FAILED_IO, + TOI_OUT_OF_MEMORY, + TOI_IMAGE_ERROR, + TOI_PLATFORM_PREP_FAILED, + TOI_CPU_HOTPLUG_FAILED, + TOI_ARCH_PREPARE_FAILED, /* Removed Linux-3.0 */ + TOI_RESAVE_NEEDED, + TOI_CANT_SUSPEND, + TOI_NOTIFIERS_PREPARE_FAILED, + TOI_PRE_SNAPSHOT_FAILED, + TOI_PRE_RESTORE_FAILED, + TOI_USERMODE_HELPERS_ERR, + TOI_CANT_USE_ALT_RESUME, + TOI_HEADER_TOO_BIG, + TOI_WAKEUP_EVENT, + TOI_SYSCORE_REFUSED, + TOI_DPM_PREPARE_FAILED, + TOI_DPM_SUSPEND_FAILED, + TOI_NUM_RESULT_STATES /* Used in printing debug info only */ +}; + +extern unsigned long toi_result; + +#define set_result_state(bit) (test_and_set_bit(bit, &toi_result)) +#define set_abort_result(bit) (test_and_set_bit(TOI_ABORTED, &toi_result), \ + test_and_set_bit(bit, &toi_result)) +#define clear_result_state(bit) (test_and_clear_bit(bit, &toi_result)) +#define test_result_state(bit) (test_bit(bit, &toi_result)) + +/* == Debug sections and levels == */ + +/* debugging levels. */ +enum { + TOI_STATUS = 0, + TOI_ERROR = 2, + TOI_LOW, + TOI_MEDIUM, + TOI_HIGH, + TOI_VERBOSE, +}; + +enum { + TOI_ANY_SECTION, + TOI_EAT_MEMORY, + TOI_IO, + TOI_HEADER, + TOI_WRITER, + TOI_MEMORY, + TOI_PAGEDIR, + TOI_COMPRESS, + TOI_BIO, +}; + +#define set_debug_state(bit) (test_and_set_bit(bit, &toi_bkd.toi_debug_state)) +#define clear_debug_state(bit) \ + (test_and_clear_bit(bit, &toi_bkd.toi_debug_state)) +#define test_debug_state(bit) (test_bit(bit, &toi_bkd.toi_debug_state)) + +/* == Steps in hibernating == */ + +enum { + STEP_HIBERNATE_PREPARE_IMAGE, + STEP_HIBERNATE_SAVE_IMAGE, + STEP_HIBERNATE_POWERDOWN, + STEP_RESUME_CAN_RESUME, + STEP_RESUME_LOAD_PS1, + STEP_RESUME_DO_RESTORE, + STEP_RESUME_READ_PS2, + STEP_RESUME_GO, + STEP_RESUME_ALT_IMAGE, + STEP_CLEANUP, + STEP_QUIET_CLEANUP +}; + +/* == TuxOnIce states == + (see also include/linux/suspend.h) */ + +#define get_toi_state() (toi_state) +#define restore_toi_state(saved_state) \ + do { toi_state = saved_state; } while (0) + +/* == Module support == */ + +struct toi_core_fns { + int (*post_context_save)(void); + unsigned long (*get_nonconflicting_page)(void); + int (*try_hibernate)(void); + void (*try_resume)(void); +}; + +extern struct toi_core_fns *toi_core_fns; + +/* == All else == */ +#define KB(x) ((x) << (PAGE_SHIFT - 10)) +#define MB(x) ((x) >> (20 - PAGE_SHIFT)) + +extern int toi_start_anything(int toi_or_resume); +extern void toi_finish_anything(int toi_or_resume); + +extern int save_image_part1(void); +extern int toi_atomic_restore(void); + +extern int toi_try_hibernate(void); +extern void toi_try_resume(void); + +extern int __toi_post_context_save(void); + +extern unsigned int nr_hibernates; +extern char alt_resume_param[256]; + +extern void copyback_post(void); +extern int toi_hibernate(void); +extern unsigned long extra_pd1_pages_used; + +#define SECTOR_SIZE 512 + +extern void toi_early_boot_message(int can_erase_image, int default_answer, + char *warning_reason, ...); + +extern int do_check_can_resume(void); +extern int do_toi_step(int step); +extern int toi_launch_userspace_program(char *command, int channel_no, + int wait, int debug); + +extern char tuxonice_signature[9]; + +extern int toi_start_other_threads(void); +extern void toi_stop_other_threads(void); + +extern int toi_trace_index; +#define TOI_TRACE_DEBUG(PFN, DESC, ...) \ + do { \ + if (test_action_state(TOI_TRACE_DEBUG_ON)) { \ + printk("*TOI* %ld %02d" DESC "\n", PFN, toi_trace_index, ##__VA_ARGS__); \ + } \ + } while(0) + +#ifdef CONFIG_TOI_KEEP_IMAGE +#define toi_keeping_image (test_action_state(TOI_KEEP_IMAGE) || test_action_state(TOI_INCREMENTAL_IMAGE)) +#else +#define toi_keeping_image (0) +#endif + +#ifdef CONFIG_TOI_INCREMENTAL +extern void toi_reset_dirtiness_one(unsigned long pfn, int verbose); +extern int toi_reset_dirtiness(int verbose); +extern void toi_cbw_write(void); +extern void toi_cbw_restore(void); +extern int toi_allocate_cbw_data(void); +extern void toi_free_cbw_data(void); +extern int toi_cbw_init(void); +extern void toi_mark_tasks_cbw(void); +#else +static inline int toi_reset_dirtiness(int verbose) { return 0; } +#define toi_cbw_write() do { } while(0) +#define toi_cbw_restore() do { } while(0) +#define toi_allocate_cbw_data() do { } while(0) +#define toi_free_cbw_data() do { } while(0) +static inline int toi_cbw_init(void) { return 0; } +#endif +#endif diff --git b/kernel/power/tuxonice_alloc.c b/kernel/power/tuxonice_alloc.c new file mode 100644 index 0000000..1d8b1cb --- /dev/null +++ b/kernel/power/tuxonice_alloc.c @@ -0,0 +1,308 @@ +/* + * kernel/power/tuxonice_alloc.c + * + * Copyright (C) 2008-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + */ + +#include +#include +#include "tuxonice_modules.h" +#include "tuxonice_alloc.h" +#include "tuxonice_sysfs.h" +#include "tuxonice.h" + +#define TOI_ALLOC_PATHS 41 + +static DEFINE_MUTEX(toi_alloc_mutex); + +static struct toi_module_ops toi_alloc_ops; + +static int toi_fail_num; + +static atomic_t toi_alloc_count[TOI_ALLOC_PATHS], + toi_free_count[TOI_ALLOC_PATHS], + toi_test_count[TOI_ALLOC_PATHS], + toi_fail_count[TOI_ALLOC_PATHS]; +static int toi_cur_allocd[TOI_ALLOC_PATHS], toi_max_allocd[TOI_ALLOC_PATHS]; +static int cur_allocd, max_allocd; + +static char *toi_alloc_desc[TOI_ALLOC_PATHS] = { + "", /* 0 */ + "get_io_info_struct", + "extent", + "extent (loading chain)", + "userui channel", + "userui arg", /* 5 */ + "attention list metadata", + "extra pagedir memory metadata", + "bdev metadata", + "extra pagedir memory", + "header_locations_read", /* 10 */ + "bio queue", + "prepare_readahead", + "i/o buffer", + "writer buffer in bio_init", + "checksum buffer", /* 15 */ + "compression buffer", + "filewriter signature op", + "set resume param alloc1", + "set resume param alloc2", + "debugging info buffer", /* 20 */ + "check can resume buffer", + "write module config buffer", + "read module config buffer", + "write image header buffer", + "read pageset1 buffer", /* 25 */ + "get_have_image_data buffer", + "checksum page", + "worker rw loop", + "get nonconflicting page", + "ps1 load addresses", /* 30 */ + "remove swap image", + "swap image exists", + "swap parse sig location", + "sysfs kobj", + "swap mark resume attempted buffer", /* 35 */ + "cluster member", + "boot kernel data buffer", + "setting swap signature", + "block i/o bdev struct", + "copy before write", /* 40 */ +}; + +#define MIGHT_FAIL(FAIL_NUM, FAIL_VAL) \ + do { \ + BUG_ON(FAIL_NUM >= TOI_ALLOC_PATHS); \ + \ + if (FAIL_NUM == toi_fail_num) { \ + atomic_inc(&toi_test_count[FAIL_NUM]); \ + toi_fail_num = 0; \ + return FAIL_VAL; \ + } \ + } while (0) + +static void alloc_update_stats(int fail_num, void *result, int size) +{ + if (!result) { + atomic_inc(&toi_fail_count[fail_num]); + return; + } + + atomic_inc(&toi_alloc_count[fail_num]); + if (unlikely(test_action_state(TOI_GET_MAX_MEM_ALLOCD))) { + mutex_lock(&toi_alloc_mutex); + toi_cur_allocd[fail_num]++; + cur_allocd += size; + if (unlikely(cur_allocd > max_allocd)) { + int i; + + for (i = 0; i < TOI_ALLOC_PATHS; i++) + toi_max_allocd[i] = toi_cur_allocd[i]; + max_allocd = cur_allocd; + } + mutex_unlock(&toi_alloc_mutex); + } +} + +static void free_update_stats(int fail_num, int size) +{ + BUG_ON(fail_num >= TOI_ALLOC_PATHS); + atomic_inc(&toi_free_count[fail_num]); + if (unlikely(atomic_read(&toi_free_count[fail_num]) > + atomic_read(&toi_alloc_count[fail_num]))) + dump_stack(); + if (unlikely(test_action_state(TOI_GET_MAX_MEM_ALLOCD))) { + mutex_lock(&toi_alloc_mutex); + cur_allocd -= size; + toi_cur_allocd[fail_num]--; + mutex_unlock(&toi_alloc_mutex); + } +} + +void *toi_kzalloc(int fail_num, size_t size, gfp_t flags) +{ + void *result; + + if (toi_alloc_ops.enabled) + MIGHT_FAIL(fail_num, NULL); + result = kzalloc(size, flags); + if (toi_alloc_ops.enabled) + alloc_update_stats(fail_num, result, size); + if (fail_num == toi_trace_allocs) + dump_stack(); + return result; +} + +unsigned long toi_get_free_pages(int fail_num, gfp_t mask, + unsigned int order) +{ + unsigned long result; + + mask |= ___GFP_TOI_NOTRACK; + if (toi_alloc_ops.enabled) + MIGHT_FAIL(fail_num, 0); + result = __get_free_pages(mask, order); + if (toi_alloc_ops.enabled) + alloc_update_stats(fail_num, (void *) result, + PAGE_SIZE << order); + if (fail_num == toi_trace_allocs) + dump_stack(); + return result; +} + +struct page *toi_alloc_page(int fail_num, gfp_t mask) +{ + struct page *result; + + if (toi_alloc_ops.enabled) + MIGHT_FAIL(fail_num, NULL); + mask |= ___GFP_TOI_NOTRACK; + result = alloc_page(mask); + if (toi_alloc_ops.enabled) + alloc_update_stats(fail_num, (void *) result, PAGE_SIZE); + if (fail_num == toi_trace_allocs) + dump_stack(); + return result; +} + +unsigned long toi_get_zeroed_page(int fail_num, gfp_t mask) +{ + unsigned long result; + + if (toi_alloc_ops.enabled) + MIGHT_FAIL(fail_num, 0); + mask |= ___GFP_TOI_NOTRACK; + result = get_zeroed_page(mask); + if (toi_alloc_ops.enabled) + alloc_update_stats(fail_num, (void *) result, PAGE_SIZE); + if (fail_num == toi_trace_allocs) + dump_stack(); + return result; +} + +void toi_kfree(int fail_num, const void *arg, int size) +{ + if (arg && toi_alloc_ops.enabled) + free_update_stats(fail_num, size); + + if (fail_num == toi_trace_allocs) + dump_stack(); + kfree(arg); +} + +void toi_free_page(int fail_num, unsigned long virt) +{ + if (virt && toi_alloc_ops.enabled) + free_update_stats(fail_num, PAGE_SIZE); + + if (fail_num == toi_trace_allocs) + dump_stack(); + free_page(virt); +} + +void toi__free_page(int fail_num, struct page *page) +{ + if (page && toi_alloc_ops.enabled) + free_update_stats(fail_num, PAGE_SIZE); + + if (fail_num == toi_trace_allocs) + dump_stack(); + __free_page(page); +} + +void toi_free_pages(int fail_num, struct page *page, int order) +{ + if (page && toi_alloc_ops.enabled) + free_update_stats(fail_num, PAGE_SIZE << order); + + if (fail_num == toi_trace_allocs) + dump_stack(); + __free_pages(page, order); +} + +void toi_alloc_print_debug_stats(void) +{ + int i, header_done = 0; + + if (!toi_alloc_ops.enabled) + return; + + for (i = 0; i < TOI_ALLOC_PATHS; i++) + if (atomic_read(&toi_alloc_count[i]) != + atomic_read(&toi_free_count[i])) { + if (!header_done) { + printk(KERN_INFO "Idx Allocs Frees Tests " + " Fails Max Description\n"); + header_done = 1; + } + + printk(KERN_INFO "%3d %7d %7d %7d %7d %7d %s\n", i, + atomic_read(&toi_alloc_count[i]), + atomic_read(&toi_free_count[i]), + atomic_read(&toi_test_count[i]), + atomic_read(&toi_fail_count[i]), + toi_max_allocd[i], + toi_alloc_desc[i]); + } +} + +static int toi_alloc_initialise(int starting_cycle) +{ + int i; + + if (!starting_cycle) + return 0; + + if (toi_trace_allocs) + dump_stack(); + + for (i = 0; i < TOI_ALLOC_PATHS; i++) { + atomic_set(&toi_alloc_count[i], 0); + atomic_set(&toi_free_count[i], 0); + atomic_set(&toi_test_count[i], 0); + atomic_set(&toi_fail_count[i], 0); + toi_cur_allocd[i] = 0; + toi_max_allocd[i] = 0; + }; + + max_allocd = 0; + cur_allocd = 0; + return 0; +} + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_INT("failure_test", SYSFS_RW, &toi_fail_num, 0, 99, 0, NULL), + SYSFS_INT("trace", SYSFS_RW, &toi_trace_allocs, 0, TOI_ALLOC_PATHS, 0, + NULL), + SYSFS_BIT("find_max_mem_allocated", SYSFS_RW, &toi_bkd.toi_action, + TOI_GET_MAX_MEM_ALLOCD, 0), + SYSFS_INT("enabled", SYSFS_RW, &toi_alloc_ops.enabled, 0, 1, 0, + NULL) +}; + +static struct toi_module_ops toi_alloc_ops = { + .type = MISC_HIDDEN_MODULE, + .name = "allocation debugging", + .directory = "alloc", + .module = THIS_MODULE, + .early = 1, + .initialise = toi_alloc_initialise, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +int toi_alloc_init(void) +{ + int result = toi_register_module(&toi_alloc_ops); + return result; +} + +void toi_alloc_exit(void) +{ + toi_unregister_module(&toi_alloc_ops); +} diff --git b/kernel/power/tuxonice_alloc.h b/kernel/power/tuxonice_alloc.h new file mode 100644 index 0000000..0cd6b68 --- /dev/null +++ b/kernel/power/tuxonice_alloc.h @@ -0,0 +1,54 @@ +/* + * kernel/power/tuxonice_alloc.h + * + * Copyright (C) 2008-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + */ + +#include +#define TOI_WAIT_GFP (GFP_NOFS | __GFP_NOWARN) +#define TOI_ATOMIC_GFP (GFP_ATOMIC | __GFP_NOWARN) + +#ifdef CONFIG_PM_DEBUG +extern void *toi_kzalloc(int fail_num, size_t size, gfp_t flags); +extern void toi_kfree(int fail_num, const void *arg, int size); + +extern unsigned long toi_get_free_pages(int fail_num, gfp_t mask, + unsigned int order); +#define toi_get_free_page(FAIL_NUM, MASK) toi_get_free_pages(FAIL_NUM, MASK, 0) +extern unsigned long toi_get_zeroed_page(int fail_num, gfp_t mask); +extern void toi_free_page(int fail_num, unsigned long buf); +extern void toi__free_page(int fail_num, struct page *page); +extern void toi_free_pages(int fail_num, struct page *page, int order); +extern struct page *toi_alloc_page(int fail_num, gfp_t mask); +extern int toi_alloc_init(void); +extern void toi_alloc_exit(void); + +extern void toi_alloc_print_debug_stats(void); + +#else /* CONFIG_PM_DEBUG */ + +#define toi_kzalloc(FAIL, SIZE, FLAGS) (kzalloc(SIZE, FLAGS)) +#define toi_kfree(FAIL, ALLOCN, SIZE) (kfree(ALLOCN)) + +#define toi_get_free_pages(FAIL, FLAGS, ORDER) __get_free_pages(FLAGS, ORDER) +#define toi_get_free_page(FAIL, FLAGS) __get_free_page(FLAGS) +#define toi_get_zeroed_page(FAIL, FLAGS) get_zeroed_page(FLAGS) +#define toi_free_page(FAIL, ALLOCN) do { free_page(ALLOCN); } while (0) +#define toi__free_page(FAIL, PAGE) __free_page(PAGE) +#define toi_free_pages(FAIL, PAGE, ORDER) __free_pages(PAGE, ORDER) +#define toi_alloc_page(FAIL, MASK) alloc_page(MASK) +static inline int toi_alloc_init(void) +{ + return 0; +} + +static inline void toi_alloc_exit(void) { } + +static inline void toi_alloc_print_debug_stats(void) { } + +#endif + +extern int toi_trace_allocs; diff --git b/kernel/power/tuxonice_atomic_copy.c b/kernel/power/tuxonice_atomic_copy.c new file mode 100644 index 0000000..5845217 --- /dev/null +++ b/kernel/power/tuxonice_atomic_copy.c @@ -0,0 +1,469 @@ +/* + * kernel/power/tuxonice_atomic_copy.c + * + * Copyright 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + * Routines for doing the atomic save/restore. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "tuxonice.h" +#include "tuxonice_storage.h" +#include "tuxonice_power_off.h" +#include "tuxonice_ui.h" +#include "tuxonice_io.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_pageflags.h" +#include "tuxonice_checksum.h" +#include "tuxonice_builtin.h" +#include "tuxonice_atomic_copy.h" +#include "tuxonice_alloc.h" +#include "tuxonice_modules.h" + +unsigned long extra_pd1_pages_used; + +/** + * free_pbe_list - free page backup entries used by the atomic copy code. + * @list: List to free. + * @highmem: Whether the list is in highmem. + * + * Normally, this function isn't used. If, however, we need to abort before + * doing the atomic copy, we use this to free the pbes previously allocated. + **/ +static void free_pbe_list(struct pbe **list, int highmem) +{ + while (*list) { + int i; + struct pbe *free_pbe, *next_page = NULL; + struct page *page; + + if (highmem) { + page = (struct page *) *list; + free_pbe = (struct pbe *) kmap(page); + } else { + page = virt_to_page(*list); + free_pbe = *list; + } + + for (i = 0; i < PBES_PER_PAGE; i++) { + if (!free_pbe) + break; + if (highmem) + toi__free_page(29, free_pbe->address); + else + toi_free_page(29, + (unsigned long) free_pbe->address); + free_pbe = free_pbe->next; + } + + if (highmem) { + if (free_pbe) + next_page = free_pbe; + kunmap(page); + } else { + if (free_pbe) + next_page = free_pbe; + } + + toi__free_page(29, page); + *list = (struct pbe *) next_page; + }; +} + +/** + * copyback_post - post atomic-restore actions + * + * After doing the atomic restore, we have a few more things to do: + * 1) We want to retain some values across the restore, so we now copy + * these from the nosave variables to the normal ones. + * 2) Set the status flags. + * 3) Resume devices. + * 4) Tell userui so it can redraw & restore settings. + * 5) Reread the page cache. + **/ +void copyback_post(void) +{ + struct toi_boot_kernel_data *bkd = + (struct toi_boot_kernel_data *) boot_kernel_data_buffer; + + if (toi_activate_storage(1)) + panic("Failed to reactivate our storage."); + + toi_post_atomic_restore_modules(bkd); + + toi_cond_pause(1, "About to reload secondary pagedir."); + + if (read_pageset2(0)) + panic("Unable to successfully reread the page cache."); + + /* + * If the user wants to sleep again after resuming from full-off, + * it's most likely to be in order to suspend to ram, so we'll + * do this check after loading pageset2, to give them the fastest + * wakeup when they are ready to use the computer again. + */ + toi_check_resleep(); + + if (test_action_state(TOI_INCREMENTAL_IMAGE)) + toi_reset_dirtiness(1); +} + +/** + * toi_copy_pageset1 - do the atomic copy of pageset1 + * + * Make the atomic copy of pageset1. We can't use copy_page (as we once did) + * because we can't be sure what side effects it has. On my old Duron, with + * 3DNOW, kernel_fpu_begin increments preempt count, making our preempt + * count at resume time 4 instead of 3. + * + * We don't want to call kmap_atomic unconditionally because it has the side + * effect of incrementing the preempt count, which will leave it one too high + * post resume (the page containing the preempt count will be copied after + * its incremented. This is essentially the same problem. + **/ +void toi_copy_pageset1(void) +{ + int i; + unsigned long source_index, dest_index; + + memory_bm_position_reset(pageset1_map); + memory_bm_position_reset(pageset1_copy_map); + + source_index = memory_bm_next_pfn(pageset1_map, 0); + dest_index = memory_bm_next_pfn(pageset1_copy_map, 0); + + for (i = 0; i < pagedir1.size; i++) { + unsigned long *origvirt, *copyvirt; + struct page *origpage, *copypage; + int loop = (PAGE_SIZE / sizeof(unsigned long)) - 1, + was_present1, was_present2; + + origpage = pfn_to_page(source_index); + copypage = pfn_to_page(dest_index); + + origvirt = PageHighMem(origpage) ? + kmap_atomic(origpage) : + page_address(origpage); + + copyvirt = PageHighMem(copypage) ? + kmap_atomic(copypage) : + page_address(copypage); + + was_present1 = kernel_page_present(origpage); + if (!was_present1) + kernel_map_pages(origpage, 1, 1); + + was_present2 = kernel_page_present(copypage); + if (!was_present2) + kernel_map_pages(copypage, 1, 1); + + while (loop >= 0) { + *(copyvirt + loop) = *(origvirt + loop); + loop--; + } + + if (!was_present1) + kernel_map_pages(origpage, 1, 0); + + if (!was_present2) + kernel_map_pages(copypage, 1, 0); + + if (PageHighMem(origpage)) + kunmap_atomic(origvirt); + + if (PageHighMem(copypage)) + kunmap_atomic(copyvirt); + + source_index = memory_bm_next_pfn(pageset1_map, 0); + dest_index = memory_bm_next_pfn(pageset1_copy_map, 0); + } +} + +/** + * __toi_post_context_save - steps after saving the cpu context + * + * Steps taken after saving the CPU state to make the actual + * atomic copy. + * + * Called from swsusp_save in snapshot.c via toi_post_context_save. + **/ +int __toi_post_context_save(void) +{ + unsigned long old_ps1_size = pagedir1.size; + + check_checksums(); + + free_checksum_pages(); + + toi_recalculate_image_contents(1); + + extra_pd1_pages_used = pagedir1.size > old_ps1_size ? + pagedir1.size - old_ps1_size : 0; + + if (extra_pd1_pages_used > extra_pd1_pages_allowance) { + printk(KERN_INFO "Pageset1 has grown by %lu pages. " + "extra_pages_allowance is currently only %lu.\n", + pagedir1.size - old_ps1_size, + extra_pd1_pages_allowance); + + /* + * Highlevel code will see this, clear the state and + * retry if we haven't already done so twice. + */ + if (any_to_free(1)) { + set_abort_result(TOI_EXTRA_PAGES_ALLOW_TOO_SMALL); + return 1; + } + if (try_allocate_extra_memory()) { + printk(KERN_INFO "Failed to allocate the extra memory" + " needed. Restarting the process."); + set_abort_result(TOI_EXTRA_PAGES_ALLOW_TOO_SMALL); + return 1; + } + printk(KERN_INFO "However it looks like there's enough" + " free ram and storage to handle this, so " + " continuing anyway."); + /* + * What if try_allocate_extra_memory above calls + * toi_allocate_extra_pagedir_memory and it allocs a new + * slab page via toi_kzalloc which should be in ps1? So... + */ + toi_recalculate_image_contents(1); + } + + if (!test_action_state(TOI_TEST_FILTER_SPEED) && + !test_action_state(TOI_TEST_BIO)) + toi_copy_pageset1(); + + return 0; +} + +/** + * toi_hibernate - high level code for doing the atomic copy + * + * High-level code which prepares to do the atomic copy. Loosely based + * on the swsusp version, but with the following twists: + * - We set toi_running so the swsusp code uses our code paths. + * - We give better feedback regarding what goes wrong if there is a + * problem. + * - We use an extra function to call the assembly, just in case this code + * is in a module (return address). + **/ +int toi_hibernate(void) +{ + int error; + + error = toi_lowlevel_builtin(); + + if (!error) { + struct toi_boot_kernel_data *bkd = + (struct toi_boot_kernel_data *) boot_kernel_data_buffer; + + /* + * The boot kernel's data may be larger (newer version) or + * smaller (older version) than ours. Copy the minimum + * of the two sizes, so that we don't overwrite valid values + * from pre-atomic copy. + */ + + memcpy(&toi_bkd, (char *) boot_kernel_data_buffer, + min_t(int, sizeof(struct toi_boot_kernel_data), + bkd->size)); + } + + return error; +} + +/** + * toi_atomic_restore - prepare to do the atomic restore + * + * Get ready to do the atomic restore. This part gets us into the same + * state we are in prior to do calling do_toi_lowlevel while + * hibernating: hot-unplugging secondary cpus and freeze processes, + * before starting the thread that will do the restore. + **/ +int toi_atomic_restore(void) +{ + int error; + + toi_prepare_status(DONT_CLEAR_BAR, "Atomic restore."); + + memcpy(&toi_bkd.toi_nosave_commandline, saved_command_line, + strlen(saved_command_line)); + + toi_pre_atomic_restore_modules(&toi_bkd); + + if (add_boot_kernel_data_pbe()) + goto Failed; + + toi_prepare_status(DONT_CLEAR_BAR, "Doing atomic copy/restore."); + + if (toi_go_atomic(PMSG_QUIESCE, 0)) + goto Failed; + + /* We'll ignore saved state, but this gets preempt count (etc) right */ + save_processor_state(); + + error = swsusp_arch_resume(); + /* + * Code below is only ever reached in case of failure. Otherwise + * execution continues at place where swsusp_arch_suspend was called. + * + * We don't know whether it's safe to continue (this shouldn't happen), + * so lets err on the side of caution. + */ + BUG(); + +Failed: + free_pbe_list(&restore_pblist, 0); +#ifdef CONFIG_HIGHMEM + free_pbe_list(&restore_highmem_pblist, 1); +#endif + return 1; +} + +/** + * toi_go_atomic - do the actual atomic copy/restore + * @state: The state to use for dpm_suspend_start & power_down calls. + * @suspend_time: Whether we're suspending or resuming. + **/ +int toi_go_atomic(pm_message_t state, int suspend_time) +{ + if (suspend_time) { + if (platform_begin(1)) { + set_abort_result(TOI_PLATFORM_PREP_FAILED); + toi_end_atomic(ATOMIC_STEP_PLATFORM_END, suspend_time, 3); + return 1; + } + + if (dpm_prepare(PMSG_FREEZE)) { + set_abort_result(TOI_DPM_PREPARE_FAILED); + dpm_complete(PMSG_RECOVER); + toi_end_atomic(ATOMIC_STEP_PLATFORM_END, suspend_time, 3); + return 1; + } + } + + suspend_console(); + pm_restrict_gfp_mask(); + + if (suspend_time) { + if (dpm_suspend(state)) { + set_abort_result(TOI_DPM_SUSPEND_FAILED); + toi_end_atomic(ATOMIC_STEP_DEVICE_RESUME, suspend_time, 3); + return 1; + } + } else { + if (dpm_suspend_start(state)) { + set_abort_result(TOI_DPM_SUSPEND_FAILED); + toi_end_atomic(ATOMIC_STEP_DEVICE_RESUME, suspend_time, 3); + return 1; + } + } + + /* At this point, dpm_suspend_start() has been called, but *not* + * dpm_suspend_noirq(). We *must* dpm_suspend_noirq() now. + * Otherwise, drivers for some devices (e.g. interrupt controllers) + * become desynchronized with the actual state of the hardware + * at resume time, and evil weirdness ensues. + */ + + if (dpm_suspend_end(state)) { + set_abort_result(TOI_DEVICE_REFUSED); + toi_end_atomic(ATOMIC_STEP_DEVICE_RESUME, suspend_time, 1); + return 1; + } + + if (suspend_time) { + if (platform_pre_snapshot(1)) + set_abort_result(TOI_PRE_SNAPSHOT_FAILED); + } else { + if (platform_pre_restore(1)) + set_abort_result(TOI_PRE_RESTORE_FAILED); + } + + if (test_result_state(TOI_ABORTED)) { + toi_end_atomic(ATOMIC_STEP_PLATFORM_FINISH, suspend_time, 1); + return 1; + } + + if (disable_nonboot_cpus()) { + set_abort_result(TOI_CPU_HOTPLUG_FAILED); + toi_end_atomic(ATOMIC_STEP_CPU_HOTPLUG, + suspend_time, 1); + return 1; + } + + local_irq_disable(); + + if (syscore_suspend()) { + set_abort_result(TOI_SYSCORE_REFUSED); + toi_end_atomic(ATOMIC_STEP_IRQS, suspend_time, 1); + return 1; + } + + if (suspend_time && pm_wakeup_pending()) { + set_abort_result(TOI_WAKEUP_EVENT); + toi_end_atomic(ATOMIC_STEP_SYSCORE_RESUME, suspend_time, 1); + return 1; + } + return 0; +} + +/** + * toi_end_atomic - post atomic copy/restore routines + * @stage: What step to start at. + * @suspend_time: Whether we're suspending or resuming. + * @error: Whether we're recovering from an error. + **/ +void toi_end_atomic(int stage, int suspend_time, int error) +{ + pm_message_t msg = suspend_time ? (error ? PMSG_RECOVER : PMSG_THAW) : + PMSG_RESTORE; + + switch (stage) { + case ATOMIC_ALL_STEPS: + if (!suspend_time) { + events_check_enabled = false; + } + platform_leave(1); + case ATOMIC_STEP_SYSCORE_RESUME: + syscore_resume(); + case ATOMIC_STEP_IRQS: + local_irq_enable(); + case ATOMIC_STEP_CPU_HOTPLUG: + enable_nonboot_cpus(); + case ATOMIC_STEP_PLATFORM_FINISH: + if (!suspend_time && error & 2) + platform_restore_cleanup(1); + else + platform_finish(1); + dpm_resume_start(msg); + case ATOMIC_STEP_DEVICE_RESUME: + if (suspend_time && (error & 2)) + platform_recover(1); + dpm_resume(msg); + if (!toi_in_suspend()) { + dpm_resume_end(PMSG_RECOVER); + } + if (error || !toi_in_suspend()) { + pm_restore_gfp_mask(); + } + resume_console(); + case ATOMIC_STEP_DPM_COMPLETE: + dpm_complete(msg); + case ATOMIC_STEP_PLATFORM_END: + platform_end(1); + + toi_prepare_status(DONT_CLEAR_BAR, "Post atomic."); + } +} diff --git b/kernel/power/tuxonice_atomic_copy.h b/kernel/power/tuxonice_atomic_copy.h new file mode 100644 index 0000000..e2d2b4f --- /dev/null +++ b/kernel/power/tuxonice_atomic_copy.h @@ -0,0 +1,25 @@ +/* + * kernel/power/tuxonice_atomic_copy.h + * + * Copyright 2008-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + * Routines for doing the atomic save/restore. + */ + +enum { + ATOMIC_ALL_STEPS, + ATOMIC_STEP_SYSCORE_RESUME, + ATOMIC_STEP_IRQS, + ATOMIC_STEP_CPU_HOTPLUG, + ATOMIC_STEP_PLATFORM_FINISH, + ATOMIC_STEP_DEVICE_RESUME, + ATOMIC_STEP_DPM_COMPLETE, + ATOMIC_STEP_PLATFORM_END, +}; + +int toi_go_atomic(pm_message_t state, int toi_time); +void toi_end_atomic(int stage, int toi_time, int error); + +extern void platform_recover(int platform_mode); diff --git b/kernel/power/tuxonice_bio.h b/kernel/power/tuxonice_bio.h new file mode 100644 index 0000000..9d52a3b --- /dev/null +++ b/kernel/power/tuxonice_bio.h @@ -0,0 +1,78 @@ +/* + * kernel/power/tuxonice_bio.h + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + * This file contains declarations for functions exported from + * tuxonice_bio.c, which contains low level io functions. + */ + +#include +#include "tuxonice_extent.h" + +void toi_put_extent_chain(struct hibernate_extent_chain *chain); +int toi_add_to_extent_chain(struct hibernate_extent_chain *chain, + unsigned long start, unsigned long end); + +struct hibernate_extent_saved_state { + int extent_num; + struct hibernate_extent *extent_ptr; + unsigned long offset; +}; + +struct toi_bdev_info { + struct toi_bdev_info *next; + struct hibernate_extent_chain blocks; + struct block_device *bdev; + struct toi_module_ops *allocator; + int allocator_index; + struct hibernate_extent_chain allocations; + char name[266]; /* "swap on " or "file " + up to 256 chars */ + + /* Saved in header */ + char uuid[17]; + dev_t dev_t; + int prio; + int bmap_shift; + int blocks_per_page; + unsigned long pages_used; + struct hibernate_extent_saved_state saved_state[4]; +}; + +struct toi_extent_iterate_state { + struct toi_bdev_info *current_chain; + int num_chains; + int saved_chain_number[4]; + struct toi_bdev_info *saved_chain_ptr[4]; +}; + +/* + * Our exported interface so the swapwriter and filewriter don't + * need these functions duplicated. + */ +struct toi_bio_ops { + int (*bdev_page_io) (int rw, struct block_device *bdev, long pos, + struct page *page); + int (*register_storage)(struct toi_bdev_info *new); + void (*free_storage)(void); +}; + +struct toi_allocator_ops { + unsigned long (*toi_swap_storage_available) (void); +}; + +extern struct toi_bio_ops toi_bio_ops; + +extern char *toi_writer_buffer; +extern int toi_writer_buffer_posn; + +struct toi_bio_allocator_ops { + int (*register_storage) (void); + unsigned long (*storage_available)(void); + int (*allocate_storage) (struct toi_bdev_info *, unsigned long); + int (*bmap) (struct toi_bdev_info *); + void (*free_storage) (struct toi_bdev_info *); + unsigned long (*free_unused_storage) (struct toi_bdev_info *, unsigned long used); +}; diff --git b/kernel/power/tuxonice_bio_chains.c b/kernel/power/tuxonice_bio_chains.c new file mode 100644 index 0000000..086a552 --- /dev/null +++ b/kernel/power/tuxonice_bio_chains.c @@ -0,0 +1,1126 @@ +/* + * kernel/power/tuxonice_bio_devinfo.c + * + * Copyright (C) 2009-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + */ + +#include +#include "tuxonice_bio.h" +#include "tuxonice_bio_internal.h" +#include "tuxonice_alloc.h" +#include "tuxonice_ui.h" +#include "tuxonice.h" +#include "tuxonice_io.h" + +static struct toi_bdev_info *prio_chain_head; +static int num_chains; + +/* Pointer to current entry being loaded/saved. */ +struct toi_extent_iterate_state toi_writer_posn; + +#define metadata_size (sizeof(struct toi_bdev_info) - \ + offsetof(struct toi_bdev_info, uuid)) + +/* + * After section 0 (header) comes 2 => next_section[0] = 2 + */ +static int next_section[3] = { 2, 3, 1 }; + +/** + * dump_block_chains - print the contents of the bdev info array. + **/ +void dump_block_chains(void) +{ + int i = 0; + int j; + struct toi_bdev_info *cur_chain = prio_chain_head; + + while (cur_chain) { + struct hibernate_extent *this = cur_chain->blocks.first; + + printk(KERN_DEBUG "Chain %d (prio %d):", i, cur_chain->prio); + + while (this) { + printk(KERN_CONT " [%lu-%lu]%s", this->start, + this->end, this->next ? "," : ""); + this = this->next; + } + + printk("\n"); + cur_chain = cur_chain->next; + i++; + } + + printk(KERN_DEBUG "Saved states:\n"); + for (i = 0; i < 4; i++) { + printk(KERN_DEBUG "Slot %d: Chain %d.\n", + i, toi_writer_posn.saved_chain_number[i]); + + cur_chain = prio_chain_head; + j = 0; + while (cur_chain) { + printk(KERN_DEBUG " Chain %d: Extent %d. Offset %lu.\n", + j, cur_chain->saved_state[i].extent_num, + cur_chain->saved_state[i].offset); + cur_chain = cur_chain->next; + j++; + } + printk(KERN_CONT "\n"); + } +} + +/** + * + **/ +static void toi_extent_chain_next(void) +{ + struct toi_bdev_info *this = toi_writer_posn.current_chain; + + if (!this->blocks.current_extent) + return; + + if (this->blocks.current_offset == this->blocks.current_extent->end) { + if (this->blocks.current_extent->next) { + this->blocks.current_extent = + this->blocks.current_extent->next; + this->blocks.current_offset = + this->blocks.current_extent->start; + } else { + this->blocks.current_extent = NULL; + this->blocks.current_offset = 0; + } + } else + this->blocks.current_offset++; +} + +/** + * + */ + +static struct toi_bdev_info *__find_next_chain_same_prio(void) +{ + struct toi_bdev_info *start_chain = toi_writer_posn.current_chain; + struct toi_bdev_info *this = start_chain; + int orig_prio = this->prio; + + do { + this = this->next; + + if (!this) + this = prio_chain_head; + + /* Back on original chain? Use it again. */ + if (this == start_chain) + return start_chain; + + } while (!this->blocks.current_extent || this->prio != orig_prio); + + return this; +} + +static void find_next_chain(void) +{ + struct toi_bdev_info *this; + + this = __find_next_chain_same_prio(); + + /* + * If we didn't get another chain of the same priority that we + * can use, look for the next priority. + */ + while (this && !this->blocks.current_extent) + this = this->next; + + toi_writer_posn.current_chain = this; +} + +/** + * toi_extent_state_next - go to the next extent + * @blocks: The number of values to progress. + * @stripe_mode: Whether to spread usage across all chains. + * + * Given a state, progress to the next valid entry. We may begin in an + * invalid state, as we do when invoked after extent_state_goto_start below. + * + * When using compression and expected_compression > 0, we let the image size + * be larger than storage, so we can validly run out of data to return. + **/ +static unsigned long toi_extent_state_next(int blocks, int current_stream) +{ + int i; + + if (!toi_writer_posn.current_chain) + return -ENOSPC; + + /* Assume chains always have lengths that are multiples of @blocks */ + for (i = 0; i < blocks; i++) + toi_extent_chain_next(); + + /* The header stream is not striped */ + if (current_stream || + !toi_writer_posn.current_chain->blocks.current_extent) + find_next_chain(); + + return toi_writer_posn.current_chain ? 0 : -ENOSPC; +} + +static void toi_insert_chain_in_prio_list(struct toi_bdev_info *this) +{ + struct toi_bdev_info **prev_ptr; + struct toi_bdev_info *cur; + + /* Loop through the existing chain, finding where to insert it */ + prev_ptr = &prio_chain_head; + cur = prio_chain_head; + + while (cur && cur->prio >= this->prio) { + prev_ptr = &cur->next; + cur = cur->next; + } + + this->next = *prev_ptr; + *prev_ptr = this; + + this = prio_chain_head; + while (this) + this = this->next; + num_chains++; +} + +/** + * toi_extent_state_goto_start - reinitialize an extent chain iterator + * @state: Iterator to reinitialize + **/ +void toi_extent_state_goto_start(void) +{ + struct toi_bdev_info *this = prio_chain_head; + + while (this) { + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Setting current extent to %p.", this->blocks.first); + this->blocks.current_extent = this->blocks.first; + if (this->blocks.current_extent) { + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Setting current offset to %lu.", + this->blocks.current_extent->start); + this->blocks.current_offset = + this->blocks.current_extent->start; + } + + this = this->next; + } + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Setting current chain to %p.", + prio_chain_head); + toi_writer_posn.current_chain = prio_chain_head; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Leaving extent state goto start."); +} + +/** + * toi_extent_state_save - save state of the iterator + * @state: Current state of the chain + * @saved_state: Iterator to populate + * + * Given a state and a struct hibernate_extent_state_store, save the current + * position in a format that can be used with relocated chains (at + * resume time). + **/ +void toi_extent_state_save(int slot) +{ + struct toi_bdev_info *cur_chain = prio_chain_head; + struct hibernate_extent *extent; + struct hibernate_extent_saved_state *chain_state; + int i = 0; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "toi_extent_state_save, slot %d.", + slot); + + if (!toi_writer_posn.current_chain) { + toi_message(TOI_BIO, TOI_VERBOSE, 0, "No current chain => " + "chain_num = -1."); + toi_writer_posn.saved_chain_number[slot] = -1; + return; + } + + while (cur_chain) { + i++; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Saving chain %d (%p) " + "state, slot %d.", i, cur_chain, slot); + + chain_state = &cur_chain->saved_state[slot]; + + chain_state->offset = cur_chain->blocks.current_offset; + + if (toi_writer_posn.current_chain == cur_chain) { + toi_writer_posn.saved_chain_number[slot] = i; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "This is the chain " + "we were on => chain_num is %d.", i); + } + + if (!cur_chain->blocks.current_extent) { + chain_state->extent_num = 0; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "No current extent " + "for this chain => extent_num %d is 0.", + i); + cur_chain = cur_chain->next; + continue; + } + + extent = cur_chain->blocks.first; + chain_state->extent_num = 1; + + while (extent != cur_chain->blocks.current_extent) { + chain_state->extent_num++; + extent = extent->next; + } + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "extent num %d is %d.", i, + chain_state->extent_num); + + cur_chain = cur_chain->next; + } + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Completed saving extent state slot %d.", slot); +} + +/** + * toi_extent_state_restore - restore the position saved by extent_state_save + * @state: State to populate + * @saved_state: Iterator saved to restore + **/ +void toi_extent_state_restore(int slot) +{ + int i = 0; + struct toi_bdev_info *cur_chain = prio_chain_head; + struct hibernate_extent_saved_state *chain_state; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "toi_extent_state_restore - slot %d.", slot); + + if (toi_writer_posn.saved_chain_number[slot] == -1) { + toi_writer_posn.current_chain = NULL; + return; + } + + while (cur_chain) { + int posn; + int j; + i++; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Restoring chain %d (%p) " + "state, slot %d.", i, cur_chain, slot); + + chain_state = &cur_chain->saved_state[slot]; + + posn = chain_state->extent_num; + + cur_chain->blocks.current_extent = cur_chain->blocks.first; + cur_chain->blocks.current_offset = chain_state->offset; + + if (i == toi_writer_posn.saved_chain_number[slot]) { + toi_writer_posn.current_chain = cur_chain; + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Found current chain."); + } + + for (j = 0; j < 4; j++) + if (i == toi_writer_posn.saved_chain_number[j]) { + toi_writer_posn.saved_chain_ptr[j] = cur_chain; + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Found saved chain ptr %d (%p) (offset" + " %d).", j, cur_chain, + cur_chain->saved_state[j].offset); + } + + if (posn) { + while (--posn) + cur_chain->blocks.current_extent = + cur_chain->blocks.current_extent->next; + } else + cur_chain->blocks.current_extent = NULL; + + cur_chain = cur_chain->next; + } + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Done."); + if (test_action_state(TOI_LOGALL)) + dump_block_chains(); +} + +/* + * Storage needed + * + * Returns amount of space in the image header required + * for the chain data. This ignores the links between + * pages, which we factor in when allocating the space. + */ +int toi_bio_devinfo_storage_needed(void) +{ + int result = sizeof(num_chains); + struct toi_bdev_info *chain = prio_chain_head; + + while (chain) { + result += metadata_size; + + /* Chain size */ + result += sizeof(int); + + /* Extents */ + result += (2 * sizeof(unsigned long) * + chain->blocks.num_extents); + + chain = chain->next; + } + + result += 4 * sizeof(int); + return result; +} + +static unsigned long chain_pages_used(struct toi_bdev_info *chain) +{ + struct hibernate_extent *this = chain->blocks.first; + struct hibernate_extent_saved_state *state = &chain->saved_state[3]; + unsigned long size = 0; + int extent_idx = 1; + + if (!state->extent_num) { + if (!this) + return 0; + else + return chain->blocks.size; + } + + while (extent_idx < state->extent_num) { + size += (this->end - this->start + 1); + this = this->next; + extent_idx++; + } + + /* We didn't use the one we're sitting on, so don't count it */ + return size + state->offset - this->start; +} + +void toi_bio_free_unused_storage_chain(struct toi_bdev_info *chain) +{ + unsigned long used = chain_pages_used(chain); + + /* Free the storage */ + unsigned long first_freed = 0; + + if (chain->allocator->bio_allocator_ops->free_unused_storage) + first_freed = chain->allocator->bio_allocator_ops->free_unused_storage(chain, used); + + printk(KERN_EMERG "Used %ld blocks in this chain. First extent freed is %lx.\n", used, first_freed); + + /* Adjust / free the extents. */ + toi_put_extent_chain_from(&chain->blocks, first_freed); + + { + struct hibernate_extent *this = chain->blocks.first; + while (this) { + printk("Extent %lx-%lx.\n", this->start, this->end); + this = this->next; + } + } +} + +/** + * toi_serialise_extent_chain - write a chain in the image + * @chain: Chain to write. + **/ +static int toi_serialise_extent_chain(struct toi_bdev_info *chain) +{ + struct hibernate_extent *this; + int ret; + int i = 1; + + chain->pages_used = chain_pages_used(chain); + + if (test_action_state(TOI_LOGALL)) + dump_block_chains(); + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Serialising chain (dev_t %lx).", + chain->dev_t); + /* Device info - dev_t, prio, bmap_shift, blocks per page, positions */ + ret = toiActiveAllocator->rw_header_chunk(WRITE, &toi_blockwriter_ops, + (char *) &chain->uuid, metadata_size); + if (ret) + return ret; + + /* Num extents */ + ret = toiActiveAllocator->rw_header_chunk(WRITE, &toi_blockwriter_ops, + (char *) &chain->blocks.num_extents, sizeof(int)); + if (ret) + return ret; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "%d extents.", + chain->blocks.num_extents); + + this = chain->blocks.first; + while (this) { + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Extent %d.", i); + ret = toiActiveAllocator->rw_header_chunk(WRITE, + &toi_blockwriter_ops, + (char *) this, 2 * sizeof(this->start)); + if (ret) + return ret; + this = this->next; + i++; + } + + return ret; +} + +int toi_serialise_extent_chains(void) +{ + struct toi_bdev_info *this = prio_chain_head; + int result; + + /* Write the number of chains */ + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Write number of chains (%d)", + num_chains); + result = toiActiveAllocator->rw_header_chunk(WRITE, + &toi_blockwriter_ops, (char *) &num_chains, + sizeof(int)); + if (result) + return result; + + /* Then the chains themselves */ + while (this) { + result = toi_serialise_extent_chain(this); + if (result) + return result; + this = this->next; + } + + /* + * Finally, the chain we should be on at the start of each + * section. + */ + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Saved chain numbers."); + result = toiActiveAllocator->rw_header_chunk(WRITE, + &toi_blockwriter_ops, + (char *) &toi_writer_posn.saved_chain_number[0], + 4 * sizeof(int)); + + return result; +} + +int toi_register_storage_chain(struct toi_bdev_info *new) +{ + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Inserting chain %p into list.", + new); + toi_insert_chain_in_prio_list(new); + return 0; +} + +static void free_bdev_info(struct toi_bdev_info *chain) +{ + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Free chain %p.", chain); + + toi_message(TOI_BIO, TOI_VERBOSE, 0, " - Block extents."); + toi_put_extent_chain(&chain->blocks); + + /* + * The allocator may need to do more than just free the chains + * (swap_free, for example). Don't call from boot kernel. + */ + toi_message(TOI_BIO, TOI_VERBOSE, 0, " - Allocator extents."); + if (chain->allocator) + chain->allocator->bio_allocator_ops->free_storage(chain); + + /* + * Dropping out of reading atomic copy? Need to undo + * toi_open_by_devnum. + */ + toi_message(TOI_BIO, TOI_VERBOSE, 0, " - Bdev."); + if (chain->bdev && !IS_ERR(chain->bdev) && + chain->bdev != resume_block_device && + chain->bdev != header_block_device && + test_toi_state(TOI_TRYING_TO_RESUME)) + toi_close_bdev(chain->bdev); + + /* Poison */ + toi_message(TOI_BIO, TOI_VERBOSE, 0, " - Struct."); + toi_kfree(39, chain, sizeof(*chain)); + + if (prio_chain_head == chain) + prio_chain_head = NULL; + + num_chains--; +} + +void free_all_bdev_info(void) +{ + struct toi_bdev_info *this = prio_chain_head; + + while (this) { + struct toi_bdev_info *next = this->next; + free_bdev_info(this); + this = next; + } + + memset((char *) &toi_writer_posn, 0, sizeof(toi_writer_posn)); + prio_chain_head = NULL; +} + +static void set_up_start_position(void) +{ + toi_writer_posn.current_chain = prio_chain_head; + go_next_page(0, 0); +} + +/** + * toi_load_extent_chain - read back a chain saved in the image + * @chain: Chain to load + * + * The linked list of extents is reconstructed from the disk. chain will point + * to the first entry. + **/ +int toi_load_extent_chain(int index, int *num_loaded) +{ + struct toi_bdev_info *chain = toi_kzalloc(39, + sizeof(struct toi_bdev_info), GFP_ATOMIC); + struct hibernate_extent *this, *last = NULL; + int i, ret; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Loading extent chain %d.", index); + /* Get dev_t, prio, bmap_shift, blocks per page, positions */ + ret = toiActiveAllocator->rw_header_chunk_noreadahead(READ, NULL, + (char *) &chain->uuid, metadata_size); + + if (ret) { + printk(KERN_ERR "Failed to read the size of extent chain.\n"); + toi_kfree(39, chain, sizeof(*chain)); + return 1; + } + + toi_bkd.pages_used[index] = chain->pages_used; + + ret = toiActiveAllocator->rw_header_chunk_noreadahead(READ, NULL, + (char *) &chain->blocks.num_extents, sizeof(int)); + if (ret) { + printk(KERN_ERR "Failed to read the size of extent chain.\n"); + toi_kfree(39, chain, sizeof(*chain)); + return 1; + } + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "%d extents.", + chain->blocks.num_extents); + + for (i = 0; i < chain->blocks.num_extents; i++) { + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Extent %d.", i + 1); + + this = toi_kzalloc(2, sizeof(struct hibernate_extent), + TOI_ATOMIC_GFP); + if (!this) { + printk(KERN_INFO "Failed to allocate a new extent.\n"); + free_bdev_info(chain); + return -ENOMEM; + } + this->next = NULL; + /* Get the next page */ + ret = toiActiveAllocator->rw_header_chunk_noreadahead(READ, + NULL, (char *) this, 2 * sizeof(this->start)); + if (ret) { + printk(KERN_INFO "Failed to read an extent.\n"); + toi_kfree(2, this, sizeof(struct hibernate_extent)); + free_bdev_info(chain); + return 1; + } + + if (last) + last->next = this; + else { + char b1[32], b2[32], b3[32]; + /* + * Open the bdev + */ + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Chain dev_t is %s. Resume dev t is %s. Header" + " bdev_t is %s.\n", + format_dev_t(b1, chain->dev_t), + format_dev_t(b2, resume_dev_t), + format_dev_t(b3, toi_sig_data->header_dev_t)); + + if (chain->dev_t == resume_dev_t) + chain->bdev = resume_block_device; + else if (chain->dev_t == toi_sig_data->header_dev_t) + chain->bdev = header_block_device; + else { + chain->bdev = toi_open_bdev(chain->uuid, + chain->dev_t, 1); + if (IS_ERR(chain->bdev)) { + free_bdev_info(chain); + return -ENODEV; + } + } + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Chain bmap shift " + "is %d and blocks per page is %d.", + chain->bmap_shift, + chain->blocks_per_page); + + chain->blocks.first = this; + + /* + * Couldn't do this earlier, but can't do + * goto_start now - we may have already used blocks + * in the first chain. + */ + chain->blocks.current_extent = this; + chain->blocks.current_offset = this->start; + + /* + * Can't wait until we've read the whole chain + * before we insert it in the list. We might need + * this chain to read the next page in the header + */ + toi_insert_chain_in_prio_list(chain); + } + + /* + * We have to wait until 2 extents are loaded before setting up + * properly because if the first extent has only one page, we + * will need to put the position on the second extent. Sounds + * obvious, but it wasn't! + */ + (*num_loaded)++; + if ((*num_loaded) == 2) + set_up_start_position(); + last = this; + } + + /* + * Shouldn't get empty chains, but it's not impossible. Link them in so + * they get freed properly later. + */ + if (!chain->blocks.num_extents) + toi_insert_chain_in_prio_list(chain); + + if (!chain->blocks.current_extent) { + chain->blocks.current_extent = chain->blocks.first; + if (chain->blocks.current_extent) + chain->blocks.current_offset = + chain->blocks.current_extent->start; + } + return 0; +} + +int toi_load_extent_chains(void) +{ + int result; + int to_load; + int i; + int extents_loaded = 0; + + result = toiActiveAllocator->rw_header_chunk_noreadahead(READ, NULL, + (char *) &to_load, + sizeof(int)); + if (result) + return result; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "%d chains to read.", to_load); + + for (i = 0; i < to_load; i++) { + toi_message(TOI_BIO, TOI_VERBOSE, 0, " >> Loading chain %d/%d.", + i, to_load); + result = toi_load_extent_chain(i, &extents_loaded); + if (result) + return result; + } + + /* If we never got to a second extent, we still need to do this. */ + if (extents_loaded == 1) + set_up_start_position(); + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Save chain numbers."); + result = toiActiveAllocator->rw_header_chunk_noreadahead(READ, + &toi_blockwriter_ops, + (char *) &toi_writer_posn.saved_chain_number[0], + 4 * sizeof(int)); + + return result; +} + +static int toi_end_of_stream(int writing, int section_barrier) +{ + struct toi_bdev_info *cur_chain = toi_writer_posn.current_chain; + int compare_to = next_section[current_stream]; + struct toi_bdev_info *compare_chain = + toi_writer_posn.saved_chain_ptr[compare_to]; + int compare_offset = compare_chain ? + compare_chain->saved_state[compare_to].offset : 0; + + if (!section_barrier) + return 0; + + if (!cur_chain) + return 1; + + if (cur_chain == compare_chain && + cur_chain->blocks.current_offset == compare_offset) { + if (writing) { + if (!current_stream) { + debug_broken_header(); + return 1; + } + } else { + more_readahead = 0; + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Reached the end of stream %d " + "(not an error).", current_stream); + return 1; + } + } + + return 0; +} + +/** + * go_next_page - skip blocks to the start of the next page + * @writing: Whether we're reading or writing the image. + * + * Go forward one page. + **/ +int go_next_page(int writing, int section_barrier) +{ + struct toi_bdev_info *cur_chain = toi_writer_posn.current_chain; + int max = cur_chain ? cur_chain->blocks_per_page : 1; + + /* Nope. Go foward a page - or maybe two. Don't stripe the header, + * so that bad fragmentation doesn't put the extent data containing + * the location of the second page out of the first header page. + */ + if (toi_extent_state_next(max, current_stream)) { + /* Don't complain if readahead falls off the end */ + if (writing && section_barrier) { + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Extent state eof. " + "Expected compression ratio too optimistic?"); + if (test_action_state(TOI_LOGALL)) + dump_block_chains(); + } + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Ran out of extents to " + "read/write. (Not necessarily a fatal error."); + return -ENOSPC; + } + + return 0; +} + +int devices_of_same_priority(struct toi_bdev_info *this) +{ + struct toi_bdev_info *check = prio_chain_head; + int i = 0; + + while (check) { + if (check->prio == this->prio) + i++; + check = check->next; + } + + return i; +} + +/** + * toi_bio_rw_page - do i/o on the next disk page in the image + * @writing: Whether reading or writing. + * @page: Page to do i/o on. + * @is_readahead: Whether we're doing readahead + * @free_group: The group used in allocating the page + * + * Submit a page for reading or writing, possibly readahead. + * Pass the group used in allocating the page as well, as it should + * be freed on completion of the bio if we're writing the page. + **/ +int toi_bio_rw_page(int writing, struct page *page, + int is_readahead, int free_group) +{ + int result = toi_end_of_stream(writing, 1); + struct toi_bdev_info *dev_info = toi_writer_posn.current_chain; + + if (result) { + if (writing) + abort_hibernate(TOI_INSUFFICIENT_STORAGE, + "Insufficient storage for your image."); + else + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Seeking to " + "read/write another page when stream has " + "ended."); + return -ENOSPC; + } + + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "%s %lx:%ld", + writing ? "Write" : "Read", + dev_info->dev_t, dev_info->blocks.current_offset); + + result = toi_do_io(writing, dev_info->bdev, + dev_info->blocks.current_offset << dev_info->bmap_shift, + page, is_readahead, 0, free_group); + + /* Ignore the result here - will check end of stream if come in again */ + go_next_page(writing, 1); + + if (result) + printk(KERN_ERR "toi_do_io returned %d.\n", result); + return result; +} + +dev_t get_header_dev_t(void) +{ + return prio_chain_head->dev_t; +} + +struct block_device *get_header_bdev(void) +{ + return prio_chain_head->bdev; +} + +unsigned long get_headerblock(void) +{ + return prio_chain_head->blocks.first->start << + prio_chain_head->bmap_shift; +} + +int get_main_pool_phys_params(void) +{ + struct toi_bdev_info *this = prio_chain_head; + int result; + + while (this) { + result = this->allocator->bio_allocator_ops->bmap(this); + if (result) + return result; + this = this->next; + } + + return 0; +} + +static int apply_header_reservation(void) +{ + int i; + + if (!header_pages_reserved) { + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "No header pages reserved at the moment."); + return 0; + } + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Applying header reservation."); + + /* Apply header space reservation */ + toi_extent_state_goto_start(); + + for (i = 0; i < header_pages_reserved; i++) + if (go_next_page(1, 0)) + return -ENOSPC; + + /* The end of header pages will be the start of pageset 2 */ + toi_extent_state_save(2); + + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Finished applying header reservation."); + return 0; +} + +static int toi_bio_register_storage(void) +{ + int result = 0; + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + this_module->type != BIO_ALLOCATOR_MODULE) + continue; + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Registering storage from %s.", + this_module->name); + result = this_module->bio_allocator_ops->register_storage(); + if (result) + break; + } + + return result; +} + +void toi_bio_free_unused_storage(void) +{ + struct toi_bdev_info *this = prio_chain_head; + + while (this) { + toi_bio_free_unused_storage_chain(this); + this = this->next; + } +} + +int toi_bio_allocate_storage(unsigned long request) +{ + struct toi_bdev_info *chain = prio_chain_head; + unsigned long to_get = request; + unsigned long extra_pages, needed; + int no_free = 0; + + if (!chain) { + int result = toi_bio_register_storage(); + toi_message(TOI_BIO, TOI_VERBOSE, 0, "toi_bio_allocate_storage: " + "Registering storage."); + if (result) + return 0; + chain = prio_chain_head; + if (!chain) { + printk("TuxOnIce: No storage was registered.\n"); + return 0; + } + } + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "toi_bio_allocate_storage: " + "Request is %lu pages.", request); + extra_pages = DIV_ROUND_UP(request * (sizeof(unsigned long) + + sizeof(int)), PAGE_SIZE); + needed = request + extra_pages + header_pages_reserved; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Adding %lu extra pages and %lu " + "for header => %lu.", + extra_pages, header_pages_reserved, needed); + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Already allocated %lu pages.", + raw_pages_allocd); + + to_get = needed > raw_pages_allocd ? needed - raw_pages_allocd : 0; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Need to get %lu pages.", to_get); + + if (!to_get) + return apply_header_reservation(); + + while (to_get && chain) { + int num_group = devices_of_same_priority(chain); + int divisor = num_group - no_free; + int i; + unsigned long portion = DIV_ROUND_UP(to_get, divisor); + unsigned long got = 0; + unsigned long got_this_round = 0; + struct toi_bdev_info *top = chain; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, + " Start of loop. To get is %lu. Divisor is %d.", + to_get, divisor); + no_free = 0; + + /* + * We're aiming to spread the allocated storage as evenly + * as possible, but we also want to get all the storage we + * can off this priority. + */ + for (i = 0; i < num_group; i++) { + struct toi_bio_allocator_ops *ops = + chain->allocator->bio_allocator_ops; + toi_message(TOI_BIO, TOI_VERBOSE, 0, + " Asking for %lu pages from chain %p.", + portion, chain); + got = ops->allocate_storage(chain, portion); + toi_message(TOI_BIO, TOI_VERBOSE, 0, + " Got %lu pages from allocator %p.", + got, chain); + if (!got) + no_free++; + got_this_round += got; + chain = chain->next; + } + toi_message(TOI_BIO, TOI_VERBOSE, 0, " Loop finished. Got a " + "total of %lu pages from %d allocators.", + got_this_round, divisor - no_free); + + raw_pages_allocd += got_this_round; + to_get = needed > raw_pages_allocd ? needed - raw_pages_allocd : + 0; + + /* + * If we got anything from chains of this priority and we + * still have storage to allocate, go over this priority + * again. + */ + if (got_this_round && to_get) + chain = top; + else + no_free = 0; + } + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Finished allocating. Calling " + "get_main_pool_phys_params"); + /* Now let swap allocator bmap the pages */ + get_main_pool_phys_params(); + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Done. Reserving header."); + return apply_header_reservation(); +} + +void toi_bio_chains_post_atomic(struct toi_boot_kernel_data *bkd) +{ + int i = 0; + struct toi_bdev_info *cur_chain = prio_chain_head; + + while (cur_chain) { + cur_chain->pages_used = bkd->pages_used[i]; + cur_chain = cur_chain->next; + i++; + } +} + +int toi_bio_chains_debug_info(char *buffer, int size) +{ + /* Show what we actually used */ + struct toi_bdev_info *cur_chain = prio_chain_head; + int len = 0; + + while (cur_chain) { + len += scnprintf(buffer + len, size - len, " Used %lu pages " + "from %s.\n", cur_chain->pages_used, + cur_chain->name); + cur_chain = cur_chain->next; + } + + return len; +} + +void toi_bio_store_inc_image_ptr(struct toi_incremental_image_pointer *ptr) +{ + struct toi_bdev_info *this = toi_writer_posn.current_chain, + *cmp = prio_chain_head; + + ptr->save.chain = 1; + while (this != cmp) { + ptr->save.chain++; + cmp = cmp->next; + } + ptr->save.block = this->blocks.current_offset; + + /* Save the raw info internally for quicker access when updating pointers */ + ptr->bdev = this->bdev; + ptr->block = this->blocks.current_offset << this->bmap_shift; +} + +void toi_bio_restore_inc_image_ptr(struct toi_incremental_image_pointer *ptr) +{ + int i = ptr->save.chain - 1; + struct toi_bdev_info *this; + struct hibernate_extent *hib; + + /* Find chain by stored index */ + this = prio_chain_head; + while (i) { + this = this->next; + i--; + } + toi_writer_posn.current_chain = this; + + /* Restore block */ + this->blocks.current_offset = ptr->save.block; + + /* Find current offset from block number */ + hib = this->blocks.first; + + while (hib->start > ptr->save.block) { + hib = hib->next; + } + + this->blocks.last_touched = this->blocks.current_extent = hib; +} diff --git b/kernel/power/tuxonice_bio_core.c b/kernel/power/tuxonice_bio_core.c new file mode 100644 index 0000000..87aa4c9 --- /dev/null +++ b/kernel/power/tuxonice_bio_core.c @@ -0,0 +1,1932 @@ +/* + * kernel/power/tuxonice_bio.c + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + * This file contains block io functions for TuxOnIce. These are + * used by the swapwriter and it is planned that they will also + * be used by the NFSwriter. + * + */ + +#include +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_bio.h" +#include "tuxonice_ui.h" +#include "tuxonice_alloc.h" +#include "tuxonice_io.h" +#include "tuxonice_builtin.h" +#include "tuxonice_bio_internal.h" + +#define MEMORY_ONLY 1 +#define THROTTLE_WAIT 2 + +/* #define MEASURE_MUTEX_CONTENTION */ +#ifndef MEASURE_MUTEX_CONTENTION +#define my_mutex_lock(index, the_lock) mutex_lock(the_lock) +#define my_mutex_unlock(index, the_lock) mutex_unlock(the_lock) +#else +unsigned long mutex_times[2][2][NR_CPUS]; +#define my_mutex_lock(index, the_lock) do { \ + int have_mutex; \ + have_mutex = mutex_trylock(the_lock); \ + if (!have_mutex) { \ + mutex_lock(the_lock); \ + mutex_times[index][0][smp_processor_id()]++; \ + } else { \ + mutex_times[index][1][smp_processor_id()]++; \ + } + +#define my_mutex_unlock(index, the_lock) \ + mutex_unlock(the_lock); \ +} while (0) +#endif + +static int page_idx, reset_idx; + +static int target_outstanding_io = 1024; +static int max_outstanding_writes, max_outstanding_reads; + +static struct page *bio_queue_head, *bio_queue_tail; +static atomic_t toi_bio_queue_size; +static DEFINE_SPINLOCK(bio_queue_lock); + +static int free_mem_throttle, throughput_throttle; +int more_readahead = 1; +static struct page *readahead_list_head, *readahead_list_tail; + +static struct page *waiting_on; + +static atomic_t toi_io_in_progress, toi_io_done; +static DECLARE_WAIT_QUEUE_HEAD(num_in_progress_wait); + +int current_stream; +/* Not static, so that the allocators can setup and complete + * writing the header */ +char *toi_writer_buffer; +int toi_writer_buffer_posn; + +static DEFINE_MUTEX(toi_bio_mutex); +static DEFINE_MUTEX(toi_bio_readahead_mutex); + +static struct task_struct *toi_queue_flusher; +static int toi_bio_queue_flush_pages(int dedicated_thread); + +struct toi_module_ops toi_blockwriter_ops; + +struct toi_incremental_image_pointer toi_inc_ptr[2][2]; + +#define TOTAL_OUTSTANDING_IO (atomic_read(&toi_io_in_progress) + \ + atomic_read(&toi_bio_queue_size)) + +unsigned long raw_pages_allocd, header_pages_reserved; + +static int toi_rw_buffer(int writing, char *buffer, int buffer_size, + int no_readahead); + +/** + * set_free_mem_throttle - set the point where we pause to avoid oom. + * + * Initially, this value is zero, but when we first fail to allocate memory, + * we set it (plus a buffer) and thereafter throttle i/o once that limit is + * reached. + **/ +static void set_free_mem_throttle(void) +{ + int new_throttle = nr_free_buffer_pages() + 256; + + if (new_throttle > free_mem_throttle) + free_mem_throttle = new_throttle; +} + +#define NUM_REASONS 7 +static atomic_t reasons[NUM_REASONS]; +static char *reason_name[NUM_REASONS] = { + "readahead not ready", + "bio allocation", + "synchronous I/O", + "toi_bio_get_new_page", + "memory low", + "readahead buffer allocation", + "throughput_throttle", +}; + +/* User Specified Parameters. */ +unsigned long resume_firstblock; +dev_t resume_dev_t; +struct block_device *resume_block_device; +static atomic_t resume_bdev_open_count; + +struct block_device *header_block_device; + +/** + * toi_open_bdev: Open a bdev at resume time. + * + * index: The swap index. May be MAX_SWAPFILES for the resume_dev_t + * (the user can have resume= pointing at a swap partition/file that isn't + * swapon'd when they hibernate. MAX_SWAPFILES+1 for the first page of the + * header. It will be from a swap partition that was enabled when we hibernated, + * but we don't know it's real index until we read that first page. + * dev_t: The device major/minor. + * display_errs: Whether to try to do this quietly. + * + * We stored a dev_t in the image header. Open the matching device without + * requiring /dev/ in most cases and record the details needed + * to close it later and avoid duplicating work. + */ +struct block_device *toi_open_bdev(char *uuid, dev_t default_device, + int display_errs) +{ + struct block_device *bdev; + dev_t device = default_device; + char buf[32]; + int retried = 0; + +retry: + if (uuid) { + struct fs_info seek; + strncpy((char *) &seek.uuid, uuid, 16); + seek.dev_t = 0; + seek.last_mount_size = 0; + device = blk_lookup_fs_info(&seek); + if (!device) { + device = default_device; + printk(KERN_DEBUG "Unable to resolve uuid. Falling back" + " to dev_t.\n"); + } else + printk(KERN_DEBUG "Resolved uuid to device %s.\n", + format_dev_t(buf, device)); + } + + if (!device) { + printk(KERN_ERR "TuxOnIce attempting to open a " + "blank dev_t!\n"); + dump_stack(); + return NULL; + } + bdev = toi_open_by_devnum(device); + + if (IS_ERR(bdev) || !bdev) { + if (!retried) { + retried = 1; + wait_for_device_probe(); + goto retry; + } + if (display_errs) + toi_early_boot_message(1, TOI_CONTINUE_REQ, + "Failed to get access to block device " + "\"%x\" (error %d).\n Maybe you need " + "to run mknod and/or lvmsetup in an " + "initrd/ramfs?", device, bdev); + return ERR_PTR(-EINVAL); + } + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "TuxOnIce got bdev %p for dev_t %x.", + bdev, device); + + return bdev; +} + +static void toi_bio_reserve_header_space(unsigned long request) +{ + header_pages_reserved = request; +} + +/** + * do_bio_wait - wait for some TuxOnIce I/O to complete + * @reason: The array index of the reason we're waiting. + * + * Wait for a particular page of I/O if we're after a particular page. + * If we're not after a particular page, wait instead for all in flight + * I/O to be completed or for us to have enough free memory to be able + * to submit more I/O. + * + * If we wait, we also update our statistics regarding why we waited. + **/ +static void do_bio_wait(int reason) +{ + struct page *was_waiting_on = waiting_on; + + /* On SMP, waiting_on can be reset, so we make a copy */ + if (was_waiting_on) { + wait_on_page_locked(was_waiting_on); + atomic_inc(&reasons[reason]); + } else { + atomic_inc(&reasons[reason]); + + wait_event(num_in_progress_wait, + !atomic_read(&toi_io_in_progress) || + nr_free_buffer_pages() > free_mem_throttle); + } +} + +/** + * throttle_if_needed - wait for I/O completion if throttle points are reached + * @flags: What to check and how to act. + * + * Check whether we need to wait for some I/O to complete. We always check + * whether we have enough memory available, but may also (depending upon + * @reason) check if the throughput throttle limit has been reached. + **/ +static int throttle_if_needed(int flags) +{ + int free_pages = nr_free_buffer_pages(); + + /* Getting low on memory and I/O is in progress? */ + while (unlikely(free_pages < free_mem_throttle) && + atomic_read(&toi_io_in_progress) && + !test_result_state(TOI_ABORTED)) { + if (!(flags & THROTTLE_WAIT)) + return -ENOMEM; + do_bio_wait(4); + free_pages = nr_free_buffer_pages(); + } + + while (!(flags & MEMORY_ONLY) && throughput_throttle && + TOTAL_OUTSTANDING_IO >= throughput_throttle && + !test_result_state(TOI_ABORTED)) { + int result = toi_bio_queue_flush_pages(0); + if (result) + return result; + atomic_inc(&reasons[6]); + wait_event(num_in_progress_wait, + !atomic_read(&toi_io_in_progress) || + TOTAL_OUTSTANDING_IO < throughput_throttle); + } + + return 0; +} + +/** + * update_throughput_throttle - update the raw throughput throttle + * @jif_index: The number of times this function has been called. + * + * This function is called four times per second by the core, and used to limit + * the amount of I/O we submit at once, spreading out our waiting through the + * whole job and letting userui get an opportunity to do its work. + * + * We don't start limiting I/O until 1/4s has gone so that we get a + * decent sample for our initial limit, and keep updating it because + * throughput may vary (on rotating media, eg) with our block number. + * + * We throttle to 1/10s worth of I/O. + **/ +static void update_throughput_throttle(int jif_index) +{ + int done = atomic_read(&toi_io_done); + throughput_throttle = done * 2 / 5 / jif_index; +} + +/** + * toi_finish_all_io - wait for all outstanding i/o to complete + * + * Flush any queued but unsubmitted I/O and wait for it all to complete. + **/ +static int toi_finish_all_io(void) +{ + int result = toi_bio_queue_flush_pages(0); + toi_bio_queue_flusher_should_finish = 1; + wake_up(&toi_io_queue_flusher); + wait_event(num_in_progress_wait, !TOTAL_OUTSTANDING_IO); + return result; +} + +/** + * toi_end_bio - bio completion function. + * @bio: bio that has completed. + * + * Function called by the block driver from interrupt context when I/O is + * completed. If we were writing the page, we want to free it and will have + * set bio->bi_private to the parameter we should use in telling the page + * allocation accounting code what the page was allocated for. If we're + * reading the page, it will be in the singly linked list made from + * page->private pointers. + **/ +static void toi_end_bio(struct bio *bio) +{ + struct page *page = bio->bi_io_vec[0].bv_page; + + BUG_ON(bio->bi_error); + + unlock_page(page); + bio_put(bio); + + if (waiting_on == page) + waiting_on = NULL; + + put_page(page); + + if (bio->bi_private) + toi__free_page((int) ((unsigned long) bio->bi_private) , page); + + bio_put(bio); + + atomic_dec(&toi_io_in_progress); + atomic_inc(&toi_io_done); + + wake_up(&num_in_progress_wait); +} + +/** + * submit - submit BIO request + * @writing: READ or WRITE. + * @dev: The block device we're using. + * @first_block: The first sector we're using. + * @page: The page being used for I/O. + * @free_group: If writing, the group that was used in allocating the page + * and which will be used in freeing the page from the completion + * routine. + * + * Based on Patrick Mochell's pmdisk code from long ago: "Straight from the + * textbook - allocate and initialize the bio. If we're writing, make sure + * the page is marked as dirty. Then submit it and carry on." + * + * If we're just testing the speed of our own code, we fake having done all + * the hard work and all toi_end_bio immediately. + **/ +static int submit(int writing, struct block_device *dev, sector_t first_block, + struct page *page, int free_group) +{ + struct bio *bio = NULL; + int cur_outstanding_io, result; + + /* + * Shouldn't throttle if reading - can deadlock in the single + * threaded case as pages are only freed when we use the + * readahead. + */ + if (writing) { + result = throttle_if_needed(MEMORY_ONLY | THROTTLE_WAIT); + if (result) + return result; + } + + while (!bio) { + bio = bio_alloc(TOI_ATOMIC_GFP, 1); + if (!bio) { + set_free_mem_throttle(); + do_bio_wait(1); + } + } + + bio->bi_bdev = dev; + bio->bi_iter.bi_sector = first_block; + bio->bi_private = (void *) ((unsigned long) free_group); + bio->bi_end_io = toi_end_bio; + bio_set_flag(bio, BIO_TOI); + + if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) { + printk(KERN_DEBUG "ERROR: adding page to bio at %lld\n", + (unsigned long long) first_block); + bio_put(bio); + return -EFAULT; + } + + bio_get(bio); + + cur_outstanding_io = atomic_add_return(1, &toi_io_in_progress); + if (writing) { + if (cur_outstanding_io > max_outstanding_writes) + max_outstanding_writes = cur_outstanding_io; + } else { + if (cur_outstanding_io > max_outstanding_reads) + max_outstanding_reads = cur_outstanding_io; + } + + /* Still read the header! */ + if (unlikely(test_action_state(TOI_TEST_BIO) && writing)) { + /* Fake having done the hard work */ + bio->bi_error = 0; + toi_end_bio(bio); + } else + submit_bio(writing | REQ_SYNC, bio); + + return 0; +} + +/** + * toi_do_io: Prepare to do some i/o on a page and submit or batch it. + * + * @writing: Whether reading or writing. + * @bdev: The block device which we're using. + * @block0: The first sector we're reading or writing. + * @page: The page on which I/O is being done. + * @readahead_index: If doing readahead, the index (reset this flag when done). + * @syncio: Whether the i/o is being done synchronously. + * + * Prepare and start a read or write operation. + * + * Note that we always work with our own page. If writing, we might be given a + * compression buffer that will immediately be used to start compressing the + * next page. For reading, we do readahead and therefore don't know the final + * address where the data needs to go. + **/ +int toi_do_io(int writing, struct block_device *bdev, long block0, + struct page *page, int is_readahead, int syncio, int free_group) +{ + page->private = 0; + + /* Do here so we don't race against toi_bio_get_next_page_read */ + lock_page(page); + + if (is_readahead) { + if (readahead_list_head) + readahead_list_tail->private = (unsigned long) page; + else + readahead_list_head = page; + + readahead_list_tail = page; + } + + /* Done before submitting to avoid races. */ + if (syncio) + waiting_on = page; + + /* Submit the page */ + get_page(page); + + if (submit(writing, bdev, block0, page, free_group)) + return -EFAULT; + + if (syncio) + do_bio_wait(2); + + return 0; +} + +/** + * toi_bdev_page_io - simpler interface to do directly i/o on a single page + * @writing: Whether reading or writing. + * @bdev: Block device on which we're operating. + * @pos: Sector at which page to read or write starts. + * @page: Page to be read/written. + * + * A simple interface to submit a page of I/O and wait for its completion. + * The caller must free the page used. + **/ +static int toi_bdev_page_io(int writing, struct block_device *bdev, + long pos, struct page *page) +{ + return toi_do_io(writing, bdev, pos, page, 0, 1, 0); +} + +/** + * toi_bio_memory_needed - report the amount of memory needed for block i/o + * + * We want to have at least enough memory so as to have target_outstanding_io + * or more transactions on the fly at once. If we can do more, fine. + **/ +static int toi_bio_memory_needed(void) +{ + return target_outstanding_io * (PAGE_SIZE + sizeof(struct request) + + sizeof(struct bio)); +} + +/** + * toi_bio_print_debug_stats - put out debugging info in the buffer provided + * @buffer: A buffer of size @size into which text should be placed. + * @size: The size of @buffer. + * + * Fill a buffer with debugging info. This is used for both our debug_info sysfs + * entry and for recording the same info in dmesg. + **/ +static int toi_bio_print_debug_stats(char *buffer, int size) +{ + int len = 0; + + if (toiActiveAllocator != &toi_blockwriter_ops) { + len = scnprintf(buffer, size, + "- Block I/O inactive.\n"); + return len; + } + + len = scnprintf(buffer, size, "- Block I/O active.\n"); + + len += toi_bio_chains_debug_info(buffer + len, size - len); + + len += scnprintf(buffer + len, size - len, + "- Max outstanding reads %d. Max writes %d.\n", + max_outstanding_reads, max_outstanding_writes); + + len += scnprintf(buffer + len, size - len, + " Memory_needed: %d x (%lu + %u + %u) = %d bytes.\n", + target_outstanding_io, + PAGE_SIZE, (unsigned int) sizeof(struct request), + (unsigned int) sizeof(struct bio), toi_bio_memory_needed()); + +#ifdef MEASURE_MUTEX_CONTENTION + { + int i; + + len += scnprintf(buffer + len, size - len, + " Mutex contention while reading:\n Contended Free\n"); + + for_each_online_cpu(i) + len += scnprintf(buffer + len, size - len, + " %9lu %9lu\n", + mutex_times[0][0][i], mutex_times[0][1][i]); + + len += scnprintf(buffer + len, size - len, + " Mutex contention while writing:\n Contended Free\n"); + + for_each_online_cpu(i) + len += scnprintf(buffer + len, size - len, + " %9lu %9lu\n", + mutex_times[1][0][i], mutex_times[1][1][i]); + + } +#endif + + return len + scnprintf(buffer + len, size - len, + " Free mem throttle point reached %d.\n", free_mem_throttle); +} + +static int total_header_bytes; +static int unowned; + +void debug_broken_header(void) +{ + printk(KERN_DEBUG "Image header too big for size allocated!\n"); + print_toi_header_storage_for_modules(); + printk(KERN_DEBUG "Page flags : %d.\n", toi_pageflags_space_needed()); + printk(KERN_DEBUG "toi_header : %zu.\n", sizeof(struct toi_header)); + printk(KERN_DEBUG "Total unowned : %d.\n", unowned); + printk(KERN_DEBUG "Total used : %d (%ld pages).\n", total_header_bytes, + DIV_ROUND_UP(total_header_bytes, PAGE_SIZE)); + printk(KERN_DEBUG "Space needed now : %ld.\n", + get_header_storage_needed()); + dump_block_chains(); + abort_hibernate(TOI_HEADER_TOO_BIG, "Header reservation too small."); +} + +static int toi_bio_update_previous_inc_img_ptr(int stream) +{ + int result; + char * buffer = (char *) toi_get_zeroed_page(12, TOI_ATOMIC_GFP); + struct page *page; + struct toi_incremental_image_pointer *prev, *this; + + prev = &toi_inc_ptr[stream][0]; + this = &toi_inc_ptr[stream][1]; + + if (!buffer) { + // We're at the start of writing a pageset. Memory should not be that scarce. + return -ENOMEM; + } + + page = virt_to_page(buffer); + result = toi_do_io(READ, prev->bdev, prev->block, page, 0, 1, 0); + + if (result) + goto out; + + memcpy(buffer, (char *) this, sizeof(this->save)); + + result = toi_do_io(WRITE, prev->bdev, prev->block, page, 0, 0, 12); + + // If the IO is successfully submitted (!result), the page will be freed + // asynchronously on completion. +out: + if (result) + toi__free_page(12, virt_to_page(buffer)); + return result; +} + +/** + * toi_rw_init_incremental - incremental image part of setting up to write new section + */ +static int toi_write_init_incremental(int stream) +{ + int result = 0; + + // Remember the location of this block so we can link to it. + toi_bio_store_inc_image_ptr(&toi_inc_ptr[stream][1]); + + // Update the pointer at the start of the last pageset with the same stream number. + result = toi_bio_update_previous_inc_img_ptr(stream); + if (result) + return result; + + // Move the current to the previous slot. + memcpy(&toi_inc_ptr[stream][0], &toi_inc_ptr[stream][1], sizeof(toi_inc_ptr[stream][1])); + + // Store a blank pointer at the start of this incremental pageset + memset(&toi_inc_ptr[stream][1], 0, sizeof(toi_inc_ptr[stream][1])); + result = toi_rw_buffer(WRITE, (char *) &toi_inc_ptr[stream][1], sizeof(toi_inc_ptr[stream][1]), 0); + if (result) + return result; + + // Serialise extent chains if this is an incremental pageset + return toi_serialise_extent_chains(); +} + +/** + * toi_read_init_incremental - incremental image part of setting up to read new section + */ +static int toi_read_init_incremental(int stream) +{ + int result; + + // Set our position to the start of the next pageset + toi_bio_restore_inc_image_ptr(&toi_inc_ptr[stream][1]); + + // Read the start of the next incremental pageset (if any) + result = toi_rw_buffer(READ, (char *) &toi_inc_ptr[stream][1], sizeof(toi_inc_ptr[stream][1]), 0); + + if (!result) + result = toi_load_extent_chains(); + + return result; +} + +/** + * toi_rw_init - prepare to read or write a stream in the image + * @writing: Whether reading or writing. + * @stream number: Section of the image being processed. + * + * Prepare to read or write a section ('stream') in the image. + **/ +static int toi_rw_init(int writing, int stream_number) +{ + if (stream_number) + toi_extent_state_restore(stream_number); + else + toi_extent_state_goto_start(); + + if (writing) { + reset_idx = 0; + if (!current_stream) + page_idx = 0; + } else { + reset_idx = 1; + } + + atomic_set(&toi_io_done, 0); + if (!toi_writer_buffer) + toi_writer_buffer = (char *) toi_get_zeroed_page(11, + TOI_ATOMIC_GFP); + toi_writer_buffer_posn = writing ? 0 : PAGE_SIZE; + + current_stream = stream_number; + + more_readahead = 1; + + if (test_result_state(TOI_KEPT_IMAGE)) { + int result; + + if (writing) { + result = toi_write_init_incremental(stream_number); + } else { + result = toi_read_init_incremental(stream_number); + } + + if (result) + return result; + } + + return toi_writer_buffer ? 0 : -ENOMEM; +} + +/** + * toi_bio_queue_write - queue a page for writing + * @full_buffer: Pointer to a page to be queued + * + * Add a page to the queue to be submitted. If we're the queue flusher, + * we'll do this once we've dropped toi_bio_mutex, so other threads can + * continue to submit I/O while we're on the slow path doing the actual + * submission. + **/ +static void toi_bio_queue_write(char **full_buffer) +{ + struct page *page = virt_to_page(*full_buffer); + unsigned long flags; + + *full_buffer = NULL; + page->private = 0; + + spin_lock_irqsave(&bio_queue_lock, flags); + if (!bio_queue_head) + bio_queue_head = page; + else + bio_queue_tail->private = (unsigned long) page; + + bio_queue_tail = page; + atomic_inc(&toi_bio_queue_size); + + spin_unlock_irqrestore(&bio_queue_lock, flags); + wake_up(&toi_io_queue_flusher); +} + +/** + * toi_rw_cleanup - Cleanup after i/o. + * @writing: Whether we were reading or writing. + * + * Flush all I/O and clean everything up after reading or writing a + * section of the image. + **/ +static int toi_rw_cleanup(int writing) +{ + int i, result = 0; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "toi_rw_cleanup."); + if (writing) { + if (toi_writer_buffer_posn && !test_result_state(TOI_ABORTED)) + toi_bio_queue_write(&toi_writer_buffer); + + while (bio_queue_head && !result) + result = toi_bio_queue_flush_pages(0); + + if (result) + return result; + + if (current_stream == 2) + toi_extent_state_save(1); + else if (current_stream == 1) + toi_extent_state_save(3); + } + + result = toi_finish_all_io(); + + while (readahead_list_head) { + void *next = (void *) readahead_list_head->private; + toi__free_page(12, readahead_list_head); + readahead_list_head = next; + } + + readahead_list_tail = NULL; + + if (!current_stream) + return result; + + for (i = 0; i < NUM_REASONS; i++) { + if (!atomic_read(&reasons[i])) + continue; + printk(KERN_DEBUG "Waited for i/o due to %s %d times.\n", + reason_name[i], atomic_read(&reasons[i])); + atomic_set(&reasons[i], 0); + } + + current_stream = 0; + return result; +} + +/** + * toi_start_one_readahead - start one page of readahead + * @dedicated_thread: Is this a thread dedicated to doing readahead? + * + * Start one new page of readahead. If this is being called by a thread + * whose only just is to submit readahead, don't quit because we failed + * to allocate a page. + **/ +static int toi_start_one_readahead(int dedicated_thread) +{ + char *buffer = NULL; + int oom = 0, result; + + result = throttle_if_needed(dedicated_thread ? THROTTLE_WAIT : 0); + if (result) { + printk("toi_start_one_readahead: throttle_if_needed returned %d.\n", result); + return result; + } + + mutex_lock(&toi_bio_readahead_mutex); + + while (!buffer) { + buffer = (char *) toi_get_zeroed_page(12, + TOI_ATOMIC_GFP); + if (!buffer) { + if (oom && !dedicated_thread) { + mutex_unlock(&toi_bio_readahead_mutex); + printk("toi_start_one_readahead: oom and !dedicated thread %d.\n", result); + return -ENOMEM; + } + + oom = 1; + set_free_mem_throttle(); + do_bio_wait(5); + } + } + + result = toi_bio_rw_page(READ, virt_to_page(buffer), 1, 0); + if (result) { + printk("toi_start_one_readahead: toi_bio_rw_page returned %d.\n", result); + } + if (result == -ENOSPC) + toi__free_page(12, virt_to_page(buffer)); + mutex_unlock(&toi_bio_readahead_mutex); + if (result) { + if (result == -ENOSPC) + toi_message(TOI_BIO, TOI_VERBOSE, 0, + "Last readahead page submitted."); + else + printk(KERN_DEBUG "toi_bio_rw_page returned %d.\n", + result); + } + return result; +} + +/** + * toi_start_new_readahead - start new readahead + * @dedicated_thread: Are we dedicated to this task? + * + * Start readahead of image pages. + * + * We can be called as a thread dedicated to this task (may be helpful on + * systems with lots of CPUs), in which case we don't exit until there's no + * more readahead. + * + * If this is not called by a dedicated thread, we top up our queue until + * there's no more readahead to submit, we've submitted the number given + * in target_outstanding_io or the number in progress exceeds the target + * outstanding I/O value. + * + * No mutex needed because this is only ever called by the first cpu. + **/ +static int toi_start_new_readahead(int dedicated_thread) +{ + int last_result, num_submitted = 0; + + /* Start a new readahead? */ + if (!more_readahead) + return 0; + + do { + last_result = toi_start_one_readahead(dedicated_thread); + + if (last_result) { + if (last_result == -ENOMEM || last_result == -ENOSPC) + return 0; + + printk(KERN_DEBUG + "Begin read chunk returned %d.\n", + last_result); + } else + num_submitted++; + + } while (more_readahead && !last_result && + (dedicated_thread || + (num_submitted < target_outstanding_io && + atomic_read(&toi_io_in_progress) < target_outstanding_io))); + + return last_result; +} + +/** + * bio_io_flusher - start the dedicated I/O flushing routine + * @writing: Whether we're writing the image. + **/ +static int bio_io_flusher(int writing) +{ + + if (writing) + return toi_bio_queue_flush_pages(1); + else + return toi_start_new_readahead(1); +} + +/** + * toi_bio_get_next_page_read - read a disk page, perhaps with readahead + * @no_readahead: Whether we can use readahead + * + * Read a page from disk, submitting readahead and cleaning up finished i/o + * while we wait for the page we're after. + **/ +static int toi_bio_get_next_page_read(int no_readahead) +{ + char *virt; + struct page *old_readahead_list_head; + + /* + * When reading the second page of the header, we have to + * delay submitting the read until after we've gotten the + * extents out of the first page. + */ + if (unlikely(no_readahead)) { + int result = toi_start_one_readahead(0); + if (result) { + printk(KERN_EMERG "No readahead and toi_start_one_readahead " + "returned non-zero.\n"); + return -EIO; + } + } + + if (unlikely(!readahead_list_head)) { + /* + * If the last page finishes exactly on the page + * boundary, we will be called one extra time and + * have no data to return. In this case, we should + * not BUG(), like we used to! + */ + if (!more_readahead) { + printk(KERN_EMERG "No more readahead.\n"); + return -ENOSPC; + } + if (unlikely(toi_start_one_readahead(0))) { + printk(KERN_EMERG "No readahead and " + "toi_start_one_readahead returned non-zero.\n"); + return -EIO; + } + } + + if (PageLocked(readahead_list_head)) { + waiting_on = readahead_list_head; + do_bio_wait(0); + } + + virt = page_address(readahead_list_head); + memcpy(toi_writer_buffer, virt, PAGE_SIZE); + + mutex_lock(&toi_bio_readahead_mutex); + old_readahead_list_head = readahead_list_head; + readahead_list_head = (struct page *) readahead_list_head->private; + mutex_unlock(&toi_bio_readahead_mutex); + toi__free_page(12, old_readahead_list_head); + return 0; +} + +/** + * toi_bio_queue_flush_pages - flush the queue of pages queued for writing + * @dedicated_thread: Whether we're a dedicated thread + * + * Flush the queue of pages ready to be written to disk. + * + * If we're a dedicated thread, stay in here until told to leave, + * sleeping in wait_event. + * + * The first thread is normally the only one to come in here. Another + * thread can enter this routine too, though, via throttle_if_needed. + * Since that's the case, we must be careful to only have one thread + * doing this work at a time. Otherwise we have a race and could save + * pages out of order. + * + * If an error occurs, free all remaining pages without submitting them + * for I/O. + **/ + +int toi_bio_queue_flush_pages(int dedicated_thread) +{ + unsigned long flags; + int result = 0; + static DEFINE_MUTEX(busy); + + if (!mutex_trylock(&busy)) + return 0; + +top: + spin_lock_irqsave(&bio_queue_lock, flags); + while (bio_queue_head) { + struct page *page = bio_queue_head; + bio_queue_head = (struct page *) page->private; + if (bio_queue_tail == page) + bio_queue_tail = NULL; + atomic_dec(&toi_bio_queue_size); + spin_unlock_irqrestore(&bio_queue_lock, flags); + + /* Don't generate more error messages if already had one */ + if (!result) + result = toi_bio_rw_page(WRITE, page, 0, 11); + /* + * If writing the page failed, don't drop out. + * Flush the rest of the queue too. + */ + if (result) + toi__free_page(11 , page); + spin_lock_irqsave(&bio_queue_lock, flags); + } + spin_unlock_irqrestore(&bio_queue_lock, flags); + + if (dedicated_thread) { + wait_event(toi_io_queue_flusher, bio_queue_head || + toi_bio_queue_flusher_should_finish); + if (likely(!toi_bio_queue_flusher_should_finish)) + goto top; + toi_bio_queue_flusher_should_finish = 0; + } + + mutex_unlock(&busy); + return result; +} + +/** + * toi_bio_get_new_page - get a new page for I/O + * @full_buffer: Pointer to a page to allocate. + **/ +static int toi_bio_get_new_page(char **full_buffer) +{ + int result = throttle_if_needed(THROTTLE_WAIT); + if (result) + return result; + + while (!*full_buffer) { + *full_buffer = (char *) toi_get_zeroed_page(11, TOI_ATOMIC_GFP); + if (!*full_buffer) { + set_free_mem_throttle(); + do_bio_wait(3); + } + } + + return 0; +} + +/** + * toi_rw_buffer - combine smaller buffers into PAGE_SIZE I/O + * @writing: Bool - whether writing (or reading). + * @buffer: The start of the buffer to write or fill. + * @buffer_size: The size of the buffer to write or fill. + * @no_readahead: Don't try to start readhead (when getting extents). + **/ +static int toi_rw_buffer(int writing, char *buffer, int buffer_size, + int no_readahead) +{ + int bytes_left = buffer_size, result = 0; + + while (bytes_left) { + char *source_start = buffer + buffer_size - bytes_left; + char *dest_start = toi_writer_buffer + toi_writer_buffer_posn; + int capacity = PAGE_SIZE - toi_writer_buffer_posn; + char *to = writing ? dest_start : source_start; + char *from = writing ? source_start : dest_start; + + if (bytes_left <= capacity) { + memcpy(to, from, bytes_left); + toi_writer_buffer_posn += bytes_left; + return 0; + } + + /* Complete this page and start a new one */ + memcpy(to, from, capacity); + bytes_left -= capacity; + + if (!writing) { + /* + * Perform actual I/O: + * read readahead_list_head into toi_writer_buffer + */ + int result = toi_bio_get_next_page_read(no_readahead); + if (result && bytes_left) { + printk("toi_bio_get_next_page_read " + "returned %d. Expecting to read %d bytes.\n", result, bytes_left); + return result; + } + } else { + toi_bio_queue_write(&toi_writer_buffer); + result = toi_bio_get_new_page(&toi_writer_buffer); + if (result) { + printk(KERN_ERR "toi_bio_get_new_page returned " + "%d.\n", result); + return result; + } + } + + toi_writer_buffer_posn = 0; + toi_cond_pause(0, NULL); + } + + return 0; +} + +/** + * toi_bio_read_page - read a page of the image + * @pfn: The pfn where the data belongs. + * @buffer_page: The page containing the (possibly compressed) data. + * @buf_size: The number of bytes on @buffer_page used (PAGE_SIZE). + * + * Read a (possibly compressed) page from the image, into buffer_page, + * returning its pfn and the buffer size. + **/ +static int toi_bio_read_page(unsigned long *pfn, int buf_type, + void *buffer_page, unsigned int *buf_size) +{ + int result = 0; + int this_idx; + char *buffer_virt = TOI_MAP(buf_type, buffer_page); + + /* + * Only call start_new_readahead if we don't have a dedicated thread + * and we're the queue flusher. + */ + if (current == toi_queue_flusher && more_readahead && + !test_action_state(TOI_NO_READAHEAD)) { + int result2 = toi_start_new_readahead(0); + if (result2) { + printk(KERN_DEBUG "Queue flusher and " + "toi_start_one_readahead returned non-zero.\n"); + result = -EIO; + goto out; + } + } + + my_mutex_lock(0, &toi_bio_mutex); + + /* + * Structure in the image: + * [destination pfn|page size|page data] + * buf_size is PAGE_SIZE + * We can validly find there's nothing to read in a multithreaded + * situation. + */ + if (toi_rw_buffer(READ, (char *) &this_idx, sizeof(int), 0) || + toi_rw_buffer(READ, (char *) pfn, sizeof(unsigned long), 0) || + toi_rw_buffer(READ, (char *) buf_size, sizeof(int), 0) || + toi_rw_buffer(READ, buffer_virt, *buf_size, 0)) { + result = -ENODATA; + goto out_unlock; + } + + if (reset_idx) { + page_idx = this_idx; + reset_idx = 0; + } else { + page_idx++; + if (!this_idx) + result = -ENODATA; + else if (page_idx != this_idx) + printk(KERN_ERR "Got page index %d, expected %d.\n", + this_idx, page_idx); + } + +out_unlock: + my_mutex_unlock(0, &toi_bio_mutex); +out: + TOI_UNMAP(buf_type, buffer_page); + return result; +} + +/** + * toi_bio_write_page - write a page of the image + * @pfn: The pfn where the data belongs. + * @buffer_page: The page containing the (possibly compressed) data. + * @buf_size: The number of bytes on @buffer_page used. + * + * Write a (possibly compressed) page to the image from the buffer, together + * with it's index and buffer size. + **/ +static int toi_bio_write_page(unsigned long pfn, int buf_type, + void *buffer_page, unsigned int buf_size) +{ + char *buffer_virt; + int result = 0, result2 = 0; + + if (unlikely(test_action_state(TOI_TEST_FILTER_SPEED))) + return 0; + + my_mutex_lock(1, &toi_bio_mutex); + + if (test_result_state(TOI_ABORTED)) { + my_mutex_unlock(1, &toi_bio_mutex); + return 0; + } + + buffer_virt = TOI_MAP(buf_type, buffer_page); + page_idx++; + + /* + * Structure in the image: + * [destination pfn|page size|page data] + * buf_size is PAGE_SIZE + */ + if (toi_rw_buffer(WRITE, (char *) &page_idx, sizeof(int), 0) || + toi_rw_buffer(WRITE, (char *) &pfn, sizeof(unsigned long), 0) || + toi_rw_buffer(WRITE, (char *) &buf_size, sizeof(int), 0) || + toi_rw_buffer(WRITE, buffer_virt, buf_size, 0)) { + printk(KERN_DEBUG "toi_rw_buffer returned non-zero to " + "toi_bio_write_page.\n"); + result = -EIO; + } + + TOI_UNMAP(buf_type, buffer_page); + my_mutex_unlock(1, &toi_bio_mutex); + + if (current == toi_queue_flusher) + result2 = toi_bio_queue_flush_pages(0); + + return result ? result : result2; +} + +/** + * _toi_rw_header_chunk - read or write a portion of the image header + * @writing: Whether reading or writing. + * @owner: The module for which we're writing. + * Used for confirming that modules + * don't use more header space than they asked for. + * @buffer: Address of the data to write. + * @buffer_size: Size of the data buffer. + * @no_readahead: Don't try to start readhead (when getting extents). + * + * Perform PAGE_SIZE I/O. Start readahead if needed. + **/ +static int _toi_rw_header_chunk(int writing, struct toi_module_ops *owner, + char *buffer, int buffer_size, int no_readahead) +{ + int result = 0; + + if (owner) { + owner->header_used += buffer_size; + toi_message(TOI_HEADER, TOI_LOW, 1, + "Header: %s : %d bytes (%d/%d) from offset %d.", + owner->name, + buffer_size, owner->header_used, + owner->header_requested, + toi_writer_buffer_posn); + if (owner->header_used > owner->header_requested && writing) { + printk(KERN_EMERG "TuxOnIce module %s is using more " + "header space (%u) than it requested (%u).\n", + owner->name, + owner->header_used, + owner->header_requested); + return buffer_size; + } + } else { + unowned += buffer_size; + toi_message(TOI_HEADER, TOI_LOW, 1, + "Header: (No owner): %d bytes (%d total so far) from " + "offset %d.", buffer_size, unowned, + toi_writer_buffer_posn); + } + + if (!writing && !no_readahead && more_readahead) { + result = toi_start_new_readahead(0); + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Start new readahead " + "returned %d.", result); + } + + if (!result) { + result = toi_rw_buffer(writing, buffer, buffer_size, + no_readahead); + toi_message(TOI_BIO, TOI_VERBOSE, 0, "rw_buffer returned " + "%d.", result); + } + + total_header_bytes += buffer_size; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "_toi_rw_header_chunk returning " + "%d.", result); + return result; +} + +static int toi_rw_header_chunk(int writing, struct toi_module_ops *owner, + char *buffer, int size) +{ + return _toi_rw_header_chunk(writing, owner, buffer, size, 1); +} + +static int toi_rw_header_chunk_noreadahead(int writing, + struct toi_module_ops *owner, char *buffer, int size) +{ + return _toi_rw_header_chunk(writing, owner, buffer, size, 1); +} + +/** + * toi_bio_storage_needed - get the amount of storage needed for my fns + **/ +static int toi_bio_storage_needed(void) +{ + return sizeof(int) + PAGE_SIZE + toi_bio_devinfo_storage_needed(); +} + +/** + * toi_bio_save_config_info - save block I/O config to image header + * @buf: PAGE_SIZE'd buffer into which data should be saved. + **/ +static int toi_bio_save_config_info(char *buf) +{ + int *ints = (int *) buf; + ints[0] = target_outstanding_io; + return sizeof(int); +} + +/** + * toi_bio_load_config_info - restore block I/O config + * @buf: Data to be reloaded. + * @size: Size of the buffer saved. + **/ +static void toi_bio_load_config_info(char *buf, int size) +{ + int *ints = (int *) buf; + target_outstanding_io = ints[0]; +} + +void close_resume_dev_t(int force) +{ + if (!resume_block_device) + return; + + if (force) + atomic_set(&resume_bdev_open_count, 0); + else + atomic_dec(&resume_bdev_open_count); + + if (!atomic_read(&resume_bdev_open_count)) { + toi_close_bdev(resume_block_device); + resume_block_device = NULL; + } +} + +int open_resume_dev_t(int force, int quiet) +{ + if (force) { + close_resume_dev_t(1); + atomic_set(&resume_bdev_open_count, 1); + } else + atomic_inc(&resume_bdev_open_count); + + if (resume_block_device) + return 0; + + resume_block_device = toi_open_bdev(NULL, resume_dev_t, 0); + if (IS_ERR(resume_block_device)) { + if (!quiet) + toi_early_boot_message(1, TOI_CONTINUE_REQ, + "Failed to open device %x, where" + " the header should be found.", + resume_dev_t); + resume_block_device = NULL; + atomic_set(&resume_bdev_open_count, 0); + return 1; + } + + return 0; +} + +/** + * toi_bio_initialise - initialise bio code at start of some action + * @starting_cycle: Whether starting a hibernation cycle, or just reading or + * writing a sysfs value. + **/ +static int toi_bio_initialise(int starting_cycle) +{ + int result; + + if (!starting_cycle || !resume_dev_t) + return 0; + + max_outstanding_writes = 0; + max_outstanding_reads = 0; + current_stream = 0; + toi_queue_flusher = current; +#ifdef MEASURE_MUTEX_CONTENTION + { + int i, j, k; + + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + for_each_online_cpu(k) + mutex_times[i][j][k] = 0; + } +#endif + result = open_resume_dev_t(0, 1); + + if (result) + return result; + + return get_signature_page(); +} + +static unsigned long raw_to_real(unsigned long raw) +{ + unsigned long extra; + + extra = (raw * (sizeof(unsigned long) + sizeof(int)) + + (PAGE_SIZE + sizeof(unsigned long) + sizeof(int) + 1)) / + (PAGE_SIZE + sizeof(unsigned long) + sizeof(int)); + + return raw > extra ? raw - extra : 0; +} + +static unsigned long toi_bio_storage_available(void) +{ + unsigned long sum = 0; + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + this_module->type != BIO_ALLOCATOR_MODULE) + continue; + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Seeking storage " + "available from %s.", this_module->name); + sum += this_module->bio_allocator_ops->storage_available(); + } + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Total storage available is %lu " + "pages (%d header pages).", sum, header_pages_reserved); + + return sum > header_pages_reserved ? + raw_to_real(sum - header_pages_reserved) : 0; + +} + +static unsigned long toi_bio_storage_allocated(void) +{ + return raw_pages_allocd > header_pages_reserved ? + raw_to_real(raw_pages_allocd - header_pages_reserved) : 0; +} + +/* + * If we have read part of the image, we might have filled memory with + * data that should be zeroed out. + */ +static void toi_bio_noresume_reset(void) +{ + toi_message(TOI_BIO, TOI_VERBOSE, 0, "toi_bio_noresume_reset."); + toi_rw_cleanup(READ); + free_all_bdev_info(); +} + +/** + * toi_bio_cleanup - cleanup after some action + * @finishing_cycle: Whether completing a cycle. + **/ +static void toi_bio_cleanup(int finishing_cycle) +{ + if (!finishing_cycle) + return; + + if (toi_writer_buffer) { + toi_free_page(11, (unsigned long) toi_writer_buffer); + toi_writer_buffer = NULL; + } + + forget_signature_page(); + + if (header_block_device && toi_sig_data && + toi_sig_data->header_dev_t != resume_dev_t) + toi_close_bdev(header_block_device); + + header_block_device = NULL; + + close_resume_dev_t(0); +} + +static int toi_bio_write_header_init(void) +{ + int result; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "toi_bio_write_header_init"); + toi_rw_init(WRITE, 0); + toi_writer_buffer_posn = 0; + + /* Info needed to bootstrap goes at the start of the header. + * First we save the positions and devinfo, including the number + * of header pages. Then we save the structs containing data needed + * for reading the header pages back. + * Note that even if header pages take more than one page, when we + * read back the info, we will have restored the location of the + * next header page by the time we go to use it. + */ + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "serialise extent chains."); + result = toi_serialise_extent_chains(); + + if (result) + return result; + + /* + * Signature page hasn't been modified at this point. Write it in + * the header so we can restore it later. + */ + toi_message(TOI_BIO, TOI_VERBOSE, 0, "serialise signature page."); + return toi_rw_header_chunk_noreadahead(WRITE, &toi_blockwriter_ops, + (char *) toi_cur_sig_page, + PAGE_SIZE); +} + +static int toi_bio_write_header_cleanup(void) +{ + int result = 0; + + if (toi_writer_buffer_posn) + toi_bio_queue_write(&toi_writer_buffer); + + result = toi_finish_all_io(); + + unowned = 0; + total_header_bytes = 0; + + /* Set signature to save we have an image */ + if (!result) + result = toi_bio_mark_have_image(); + + return result; +} + +/* + * toi_bio_read_header_init() + * + * Description: + * 1. Attempt to read the device specified with resume=. + * 2. Check the contents of the swap header for our signature. + * 3. Warn, ignore, reset and/or continue as appropriate. + * 4. If continuing, read the toi_swap configuration section + * of the header and set up block device info so we can read + * the rest of the header & image. + * + * Returns: + * May not return if user choose to reboot at a warning. + * -EINVAL if cannot resume at this time. Booting should continue + * normally. + */ + +static int toi_bio_read_header_init(void) +{ + int result = 0; + char buf[32]; + + toi_writer_buffer_posn = 0; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "toi_bio_read_header_init"); + + if (!toi_sig_data) { + printk(KERN_INFO "toi_bio_read_header_init called when we " + "haven't verified there is an image!\n"); + return -EINVAL; + } + + /* + * If the header is not on the resume_swap_dev_t, get the resume device + * first. + */ + toi_message(TOI_BIO, TOI_VERBOSE, 0, "Header dev_t is %lx.", + toi_sig_data->header_dev_t); + if (toi_sig_data->have_uuid) { + struct fs_info seek; + dev_t device; + + strncpy((char *) seek.uuid, toi_sig_data->header_uuid, 16); + seek.dev_t = toi_sig_data->header_dev_t; + seek.last_mount_size = 0; + device = blk_lookup_fs_info(&seek); + if (device) { + printk("Using dev_t %s, returned by blk_lookup_fs_info.\n", + format_dev_t(buf, device)); + toi_sig_data->header_dev_t = device; + } + } + if (toi_sig_data->header_dev_t != resume_dev_t) { + header_block_device = toi_open_bdev(NULL, + toi_sig_data->header_dev_t, 1); + + if (IS_ERR(header_block_device)) + return PTR_ERR(header_block_device); + } else + header_block_device = resume_block_device; + + if (!toi_writer_buffer) + toi_writer_buffer = (char *) toi_get_zeroed_page(11, + TOI_ATOMIC_GFP); + more_readahead = 1; + + /* + * Read toi_swap configuration. + * Headerblock size taken into account already. + */ + result = toi_bio_ops.bdev_page_io(READ, header_block_device, + toi_sig_data->first_header_block, + virt_to_page((unsigned long) toi_writer_buffer)); + if (result) + return result; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "load extent chains."); + result = toi_load_extent_chains(); + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "load original signature page."); + toi_orig_sig_page = (char *) toi_get_zeroed_page(38, TOI_ATOMIC_GFP); + if (!toi_orig_sig_page) { + printk(KERN_ERR "Failed to allocate memory for the current" + " image signature.\n"); + return -ENOMEM; + } + + return toi_rw_header_chunk_noreadahead(READ, &toi_blockwriter_ops, + (char *) toi_orig_sig_page, + PAGE_SIZE); +} + +static int toi_bio_read_header_cleanup(void) +{ + toi_message(TOI_BIO, TOI_VERBOSE, 0, "toi_bio_read_header_cleanup."); + return toi_rw_cleanup(READ); +} + +/* Works only for digits and letters, but small and fast */ +#define TOLOWER(x) ((x) | 0x20) + +/* + * UUID must be 32 chars long. It may have dashes, but nothing + * else. + */ +char *uuid_from_commandline(char *commandline) +{ + int low = 0; + char *result = NULL, *output, *ptr; + + if (strncmp(commandline, "UUID=", 5)) + return NULL; + + result = kzalloc(17, GFP_KERNEL); + if (!result) { + printk("Failed to kzalloc UUID text memory.\n"); + return NULL; + } + + ptr = commandline + 5; + output = result; + + while (*ptr && (output - result) < 16) { + if (isxdigit(*ptr)) { + int value = isdigit(*ptr) ? *ptr - '0' : + TOLOWER(*ptr) - 'a' + 10; + if (low) { + *output += value; + output++; + } else { + *output = value << 4; + } + low = !low; + } else if (*ptr != '-') + break; + ptr++; + } + + if ((output - result) < 16 || *ptr) { + printk(KERN_DEBUG "Found resume=UUID=, but the value looks " + "invalid.\n"); + kfree(result); + result = NULL; + } + + return result; +} + +#define retry_if_fails(command) \ +do { \ + command; \ + if (!resume_dev_t && !waited_for_device_probe) { \ + wait_for_device_probe(); \ + command; \ + waited_for_device_probe = 1; \ + } \ +} while(0) + +/** + * try_to_open_resume_device: Try to parse and open resume= + * + * Any "swap:" has been stripped away and we just have the path to deal with. + * We attempt to do name_to_dev_t, open and stat the file. Having opened the + * file, get the struct block_device * to match. + */ +static int try_to_open_resume_device(char *commandline, int quiet) +{ + struct kstat stat; + int error = 0; + char *uuid = uuid_from_commandline(commandline); + int waited_for_device_probe = 0; + + resume_dev_t = MKDEV(0, 0); + + if (!strlen(commandline)) + retry_if_fails(toi_bio_scan_for_image(quiet)); + + if (uuid) { + struct fs_info seek; + strncpy((char *) &seek.uuid, uuid, 16); + seek.dev_t = resume_dev_t; + seek.last_mount_size = 0; + retry_if_fails(resume_dev_t = blk_lookup_fs_info(&seek)); + kfree(uuid); + } + + if (!resume_dev_t) + retry_if_fails(resume_dev_t = name_to_dev_t(commandline)); + + if (!resume_dev_t) { + struct file *file = filp_open(commandline, + O_RDONLY|O_LARGEFILE, 0); + + if (!IS_ERR(file) && file) { + vfs_getattr(&file->f_path, &stat); + filp_close(file, NULL); + } else + error = vfs_stat(commandline, &stat); + if (!error) + resume_dev_t = stat.rdev; + } + + if (!resume_dev_t) { + if (quiet) + return 1; + + if (test_toi_state(TOI_TRYING_TO_RESUME)) + toi_early_boot_message(1, toi_translate_err_default, + "Failed to translate \"%s\" into a device id.\n", + commandline); + else + printk("TuxOnIce: Can't translate \"%s\" into a device " + "id yet.\n", commandline); + return 1; + } + + return open_resume_dev_t(1, quiet); +} + +/* + * Parse Image Location + * + * Attempt to parse a resume= parameter. + * Swap Writer accepts: + * resume=[swap:|file:]DEVNAME[:FIRSTBLOCK][@BLOCKSIZE] + * + * Where: + * DEVNAME is convertable to a dev_t by name_to_dev_t + * FIRSTBLOCK is the location of the first block in the swap file + * (specifying for a swap partition is nonsensical but not prohibited). + * Data is validated by attempting to read a swap header from the + * location given. Failure will result in toi_swap refusing to + * save an image, and a reboot with correct parameters will be + * necessary. + */ +static int toi_bio_parse_sig_location(char *commandline, + int only_allocator, int quiet) +{ + char *thischar, *devstart, *colon = NULL; + int signature_found, result = -EINVAL, temp_result = 0; + + if (strncmp(commandline, "swap:", 5) && + strncmp(commandline, "file:", 5)) { + /* + * Failing swap:, we'll take a simple resume=/dev/hda2, or a + * blank value (scan) but fall through to other allocators + * if /dev/ or UUID= isn't matched. + */ + if (strncmp(commandline, "/dev/", 5) && + strncmp(commandline, "UUID=", 5) && + strlen(commandline)) + return 1; + } else + commandline += 5; + + devstart = commandline; + thischar = commandline; + while ((*thischar != ':') && (*thischar != '@') && + ((thischar - commandline) < 250) && (*thischar)) + thischar++; + + if (*thischar == ':') { + colon = thischar; + *colon = 0; + thischar++; + } + + while ((thischar - commandline) < 250 && *thischar) + thischar++; + + if (colon) { + unsigned long block; + temp_result = kstrtoul(colon + 1, 0, &block); + if (!temp_result) + resume_firstblock = (int) block; + } else + resume_firstblock = 0; + + clear_toi_state(TOI_CAN_HIBERNATE); + clear_toi_state(TOI_CAN_RESUME); + + if (!temp_result) + temp_result = try_to_open_resume_device(devstart, quiet); + + if (colon) + *colon = ':'; + + /* No error if we only scanned */ + if (temp_result) + return strlen(commandline) ? -EINVAL : 1; + + signature_found = toi_bio_image_exists(quiet); + + if (signature_found != -1) { + result = 0; + /* + * TODO: If only file storage, CAN_HIBERNATE should only be + * set if file allocator's target is valid. + */ + set_toi_state(TOI_CAN_HIBERNATE); + set_toi_state(TOI_CAN_RESUME); + } else + if (!quiet) + printk(KERN_ERR "TuxOnIce: Block I/O: No " + "signature found at %s.\n", devstart); + + return result; +} + +static void toi_bio_release_storage(void) +{ + header_pages_reserved = 0; + raw_pages_allocd = 0; + + free_all_bdev_info(); +} + +/* toi_swap_remove_image + * + */ +static int toi_bio_remove_image(void) +{ + int result; + + toi_message(TOI_BIO, TOI_VERBOSE, 0, "toi_bio_remove_image."); + + result = toi_bio_restore_original_signature(); + + /* + * We don't do a sanity check here: we want to restore the swap + * whatever version of kernel made the hibernate image. + * + * We need to write swap, but swap may not be enabled so + * we write the device directly + * + * If we don't have an current_signature_page, we didn't + * read an image header, so don't change anything. + */ + + toi_bio_release_storage(); + + return result; +} + +struct toi_bio_ops toi_bio_ops = { + .bdev_page_io = toi_bdev_page_io, + .register_storage = toi_register_storage_chain, + .free_storage = toi_bio_release_storage, +}; + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_INT("target_outstanding_io", SYSFS_RW, &target_outstanding_io, + 0, 16384, 0, NULL), +}; + +struct toi_module_ops toi_blockwriter_ops = { + .type = WRITER_MODULE, + .name = "block i/o", + .directory = "block_io", + .module = THIS_MODULE, + .memory_needed = toi_bio_memory_needed, + .print_debug_info = toi_bio_print_debug_stats, + .storage_needed = toi_bio_storage_needed, + .save_config_info = toi_bio_save_config_info, + .load_config_info = toi_bio_load_config_info, + .initialise = toi_bio_initialise, + .cleanup = toi_bio_cleanup, + .post_atomic_restore = toi_bio_chains_post_atomic, + + .rw_init = toi_rw_init, + .rw_cleanup = toi_rw_cleanup, + .read_page = toi_bio_read_page, + .write_page = toi_bio_write_page, + .rw_header_chunk = toi_rw_header_chunk, + .rw_header_chunk_noreadahead = toi_rw_header_chunk_noreadahead, + .io_flusher = bio_io_flusher, + .update_throughput_throttle = update_throughput_throttle, + .finish_all_io = toi_finish_all_io, + + .noresume_reset = toi_bio_noresume_reset, + .storage_available = toi_bio_storage_available, + .storage_allocated = toi_bio_storage_allocated, + .reserve_header_space = toi_bio_reserve_header_space, + .allocate_storage = toi_bio_allocate_storage, + .free_unused_storage = toi_bio_free_unused_storage, + .image_exists = toi_bio_image_exists, + .mark_resume_attempted = toi_bio_mark_resume_attempted, + .write_header_init = toi_bio_write_header_init, + .write_header_cleanup = toi_bio_write_header_cleanup, + .read_header_init = toi_bio_read_header_init, + .read_header_cleanup = toi_bio_read_header_cleanup, + .get_header_version = toi_bio_get_header_version, + .remove_image = toi_bio_remove_image, + .parse_sig_location = toi_bio_parse_sig_location, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/** + * toi_block_io_load - load time routine for block I/O module + * + * Register block i/o ops and sysfs entries. + **/ +static __init int toi_block_io_load(void) +{ + return toi_register_module(&toi_blockwriter_ops); +} + +late_initcall(toi_block_io_load); diff --git b/kernel/power/tuxonice_bio_internal.h b/kernel/power/tuxonice_bio_internal.h new file mode 100644 index 0000000..5e1964a --- /dev/null +++ b/kernel/power/tuxonice_bio_internal.h @@ -0,0 +1,101 @@ +/* + * kernel/power/tuxonice_bio_internal.h + * + * Copyright (C) 2009-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + * This file contains declarations for functions exported from + * tuxonice_bio.c, which contains low level io functions. + */ + +/* Extent chains */ +void toi_extent_state_goto_start(void); +void toi_extent_state_save(int slot); +int go_next_page(int writing, int section_barrier); +void toi_extent_state_restore(int slot); +void free_all_bdev_info(void); +int devices_of_same_priority(struct toi_bdev_info *this); +int toi_register_storage_chain(struct toi_bdev_info *new); +int toi_serialise_extent_chains(void); +int toi_load_extent_chains(void); +int toi_bio_rw_page(int writing, struct page *page, int is_readahead, + int free_group); +int toi_bio_restore_original_signature(void); +int toi_bio_devinfo_storage_needed(void); +unsigned long get_headerblock(void); +dev_t get_header_dev_t(void); +struct block_device *get_header_bdev(void); +int toi_bio_allocate_storage(unsigned long request); +void toi_bio_free_unused_storage(void); + +/* Signature functions */ +#define HaveImage "HaveImage" +#define NoImage "TuxOnIce" +#define sig_size (sizeof(HaveImage)) + +struct sig_data { + char sig[sig_size]; + int have_image; + int resumed_before; + + char have_uuid; + char header_uuid[17]; + dev_t header_dev_t; + unsigned long first_header_block; + + /* Repeat the signature to be sure we have a header version */ + char sig2[sig_size]; + int header_version; +}; + +void forget_signature_page(void); +int toi_check_for_signature(void); +int toi_bio_image_exists(int quiet); +int get_signature_page(void); +int toi_bio_mark_resume_attempted(int); +extern char *toi_cur_sig_page; +extern char *toi_orig_sig_page; +int toi_bio_mark_have_image(void); +extern struct sig_data *toi_sig_data; +extern dev_t resume_dev_t; +extern struct block_device *resume_block_device; +extern struct block_device *header_block_device; +extern unsigned long resume_firstblock; + +struct block_device *open_bdev(dev_t device, int display_errs); +extern int current_stream; +extern int more_readahead; +int toi_do_io(int writing, struct block_device *bdev, long block0, + struct page *page, int is_readahead, int syncio, int free_group); +int get_main_pool_phys_params(void); + +void toi_close_bdev(struct block_device *bdev); +struct block_device *toi_open_bdev(char *uuid, dev_t default_device, + int display_errs); + +extern struct toi_module_ops toi_blockwriter_ops; +void dump_block_chains(void); +void debug_broken_header(void); +extern unsigned long raw_pages_allocd, header_pages_reserved; +int toi_bio_chains_debug_info(char *buffer, int size); +void toi_bio_chains_post_atomic(struct toi_boot_kernel_data *bkd); +int toi_bio_scan_for_image(int quiet); +int toi_bio_get_header_version(void); + +void close_resume_dev_t(int force); +int open_resume_dev_t(int force, int quiet); + +struct toi_incremental_image_pointer_saved_data { + unsigned long block; + int chain; +}; + +struct toi_incremental_image_pointer { + struct toi_incremental_image_pointer_saved_data save; + struct block_device *bdev; + unsigned long block; +}; + +void toi_bio_store_inc_image_ptr(struct toi_incremental_image_pointer *ptr); +void toi_bio_restore_inc_image_ptr(struct toi_incremental_image_pointer *ptr); diff --git b/kernel/power/tuxonice_bio_signature.c b/kernel/power/tuxonice_bio_signature.c new file mode 100644 index 0000000..f5418f0 --- /dev/null +++ b/kernel/power/tuxonice_bio_signature.c @@ -0,0 +1,403 @@ +/* + * kernel/power/tuxonice_bio_signature.c + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + */ + +#include + +#include "tuxonice.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_bio.h" +#include "tuxonice_ui.h" +#include "tuxonice_alloc.h" +#include "tuxonice_io.h" +#include "tuxonice_builtin.h" +#include "tuxonice_bio_internal.h" + +struct sig_data *toi_sig_data; + +/* Struct of swap header pages */ + +struct old_sig_data { + dev_t device; + unsigned long sector; + int resume_attempted; + int orig_sig_type; +}; + +union diskpage { + union swap_header swh; /* swh.magic is the only member used */ + struct sig_data sig_data; + struct old_sig_data old_sig_data; +}; + +union p_diskpage { + union diskpage *pointer; + char *ptr; + unsigned long address; +}; + +char *toi_cur_sig_page; +char *toi_orig_sig_page; +int have_image; +int have_old_image; + +int get_signature_page(void) +{ + if (!toi_cur_sig_page) { + toi_message(TOI_IO, TOI_VERBOSE, 0, + "Allocating current signature page."); + toi_cur_sig_page = (char *) toi_get_zeroed_page(38, + TOI_ATOMIC_GFP); + if (!toi_cur_sig_page) { + printk(KERN_ERR "Failed to allocate memory for the " + "current image signature.\n"); + return -ENOMEM; + } + + toi_sig_data = (struct sig_data *) toi_cur_sig_page; + } + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Reading signature from dev %lx," + " sector %d.", + resume_block_device->bd_dev, resume_firstblock); + + return toi_bio_ops.bdev_page_io(READ, resume_block_device, + resume_firstblock, virt_to_page(toi_cur_sig_page)); +} + +void forget_signature_page(void) +{ + if (toi_cur_sig_page) { + toi_sig_data = NULL; + toi_message(TOI_IO, TOI_VERBOSE, 0, "Freeing toi_cur_sig_page" + " (%p).", toi_cur_sig_page); + toi_free_page(38, (unsigned long) toi_cur_sig_page); + toi_cur_sig_page = NULL; + } + + if (toi_orig_sig_page) { + toi_message(TOI_IO, TOI_VERBOSE, 0, "Freeing toi_orig_sig_page" + " (%p).", toi_orig_sig_page); + toi_free_page(38, (unsigned long) toi_orig_sig_page); + toi_orig_sig_page = NULL; + } +} + +/* + * We need to ensure we use the signature page that's currently on disk, + * so as to not remove the image header. Post-atomic-restore, the orig sig + * page will be empty, so we can use that as our method of knowing that we + * need to load the on-disk signature and not use the non-image sig in + * memory. (We're going to powerdown after writing the change, so it's safe. + */ +int toi_bio_mark_resume_attempted(int flag) +{ + toi_message(TOI_IO, TOI_VERBOSE, 0, "Make resume attempted = %d.", + flag); + if (!toi_orig_sig_page) { + forget_signature_page(); + get_signature_page(); + } + toi_sig_data->resumed_before = flag; + return toi_bio_ops.bdev_page_io(WRITE, resume_block_device, + resume_firstblock, virt_to_page(toi_cur_sig_page)); +} + +int toi_bio_mark_have_image(void) +{ + int result = 0; + char buf[32]; + struct fs_info *fs_info; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Recording that an image exists."); + memcpy(toi_sig_data->sig, tuxonice_signature, + sizeof(tuxonice_signature)); + toi_sig_data->have_image = 1; + toi_sig_data->resumed_before = 0; + toi_sig_data->header_dev_t = get_header_dev_t(); + toi_sig_data->have_uuid = 0; + + fs_info = fs_info_from_block_dev(get_header_bdev()); + if (fs_info && !IS_ERR(fs_info)) { + memcpy(toi_sig_data->header_uuid, &fs_info->uuid, 16); + free_fs_info(fs_info); + } else + result = (int) PTR_ERR(fs_info); + + if (!result) { + toi_message(TOI_IO, TOI_VERBOSE, 0, "Got uuid for dev_t %s.", + format_dev_t(buf, get_header_dev_t())); + toi_sig_data->have_uuid = 1; + } else + toi_message(TOI_IO, TOI_VERBOSE, 0, "Could not get uuid for " + "dev_t %s.", + format_dev_t(buf, get_header_dev_t())); + + toi_sig_data->first_header_block = get_headerblock(); + have_image = 1; + toi_message(TOI_IO, TOI_VERBOSE, 0, "header dev_t is %x. First block " + "is %d.", toi_sig_data->header_dev_t, + toi_sig_data->first_header_block); + + memcpy(toi_sig_data->sig2, tuxonice_signature, + sizeof(tuxonice_signature)); + toi_sig_data->header_version = TOI_HEADER_VERSION; + + return toi_bio_ops.bdev_page_io(WRITE, resume_block_device, + resume_firstblock, virt_to_page(toi_cur_sig_page)); +} + +int remove_old_signature(void) +{ + union p_diskpage swap_header_page = (union p_diskpage) toi_cur_sig_page; + char *orig_sig; + char *header_start = (char *) toi_get_zeroed_page(38, TOI_ATOMIC_GFP); + int result; + struct block_device *header_bdev; + struct old_sig_data *old_sig_data = + &swap_header_page.pointer->old_sig_data; + + header_bdev = toi_open_bdev(NULL, old_sig_data->device, 1); + result = toi_bio_ops.bdev_page_io(READ, header_bdev, + old_sig_data->sector, virt_to_page(header_start)); + + if (result) + goto out; + + /* + * TODO: Get the original contents of the first bytes of the swap + * header page. + */ + if (!old_sig_data->orig_sig_type) + orig_sig = "SWAP-SPACE"; + else + orig_sig = "SWAPSPACE2"; + + memcpy(swap_header_page.pointer->swh.magic.magic, orig_sig, 10); + memcpy(swap_header_page.ptr, header_start, 10); + + result = toi_bio_ops.bdev_page_io(WRITE, resume_block_device, + resume_firstblock, virt_to_page(swap_header_page.ptr)); + +out: + toi_close_bdev(header_bdev); + have_old_image = 0; + toi_free_page(38, (unsigned long) header_start); + return result; +} + +/* + * toi_bio_restore_original_signature - restore the original signature + * + * At boot time (aborting pre atomic-restore), toi_orig_sig_page gets used. + * It will have the original signature page contents, stored in the image + * header. Post atomic-restore, we use :toi_cur_sig_page, which will contain + * the contents that were loaded when we started the cycle. + */ +int toi_bio_restore_original_signature(void) +{ + char *use = toi_orig_sig_page ? toi_orig_sig_page : toi_cur_sig_page; + + if (have_old_image) + return remove_old_signature(); + + if (!use) { + printk("toi_bio_restore_original_signature: No signature " + "page loaded.\n"); + return 0; + } + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Recording that no image exists."); + have_image = 0; + toi_sig_data->have_image = 0; + return toi_bio_ops.bdev_page_io(WRITE, resume_block_device, + resume_firstblock, virt_to_page(use)); +} + +/* + * check_for_signature - See whether we have an image. + * + * Returns 0 if no image, 1 if there is one, -1 if indeterminate. + */ +int toi_check_for_signature(void) +{ + union p_diskpage swap_header_page; + int type; + const char *normal_sigs[] = {"SWAP-SPACE", "SWAPSPACE2" }; + const char *swsusp_sigs[] = {"S1SUSP", "S2SUSP", "S1SUSPEND" }; + char *swap_header; + + if (!toi_cur_sig_page) { + int result = get_signature_page(); + + if (result) + return result; + } + + /* + * Start by looking for the binary header. + */ + if (!memcmp(tuxonice_signature, toi_cur_sig_page, + sizeof(tuxonice_signature))) { + have_image = toi_sig_data->have_image; + toi_message(TOI_IO, TOI_VERBOSE, 0, "Have binary signature. " + "Have image is %d.", have_image); + if (have_image) + toi_message(TOI_IO, TOI_VERBOSE, 0, "header dev_t is " + "%x. First block is %d.", + toi_sig_data->header_dev_t, + toi_sig_data->first_header_block); + return toi_sig_data->have_image; + } + + /* + * Failing that, try old file allocator headers. + */ + + if (!memcmp(HaveImage, toi_cur_sig_page, strlen(HaveImage))) { + have_image = 1; + return 1; + } + + have_image = 0; + + if (!memcmp(NoImage, toi_cur_sig_page, strlen(NoImage))) + return 0; + + /* + * Nope? How about swap? + */ + swap_header_page = (union p_diskpage) toi_cur_sig_page; + swap_header = swap_header_page.pointer->swh.magic.magic; + + /* Normal swapspace? */ + for (type = 0; type < 2; type++) + if (!memcmp(normal_sigs[type], swap_header, + strlen(normal_sigs[type]))) + return 0; + + /* Swsusp or uswsusp? */ + for (type = 0; type < 3; type++) + if (!memcmp(swsusp_sigs[type], swap_header, + strlen(swsusp_sigs[type]))) + return 2; + + /* Old TuxOnIce version? */ + if (!memcmp(tuxonice_signature, swap_header, + sizeof(tuxonice_signature) - 1)) { + toi_message(TOI_IO, TOI_VERBOSE, 0, "Found old TuxOnIce " + "signature."); + have_old_image = 1; + return 3; + } + + return -1; +} + +/* + * Image_exists + * + * Returns -1 if don't know, otherwise 0 (no) or 1 (yes). + */ +int toi_bio_image_exists(int quiet) +{ + int result; + char *msg = NULL; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "toi_bio_image_exists."); + + if (!resume_dev_t) { + if (!quiet) + printk(KERN_INFO "Not even trying to read header " + "because resume_dev_t is not set.\n"); + return -1; + } + + if (open_resume_dev_t(0, quiet)) + return -1; + + result = toi_check_for_signature(); + + clear_toi_state(TOI_RESUMED_BEFORE); + if (toi_sig_data->resumed_before) + set_toi_state(TOI_RESUMED_BEFORE); + + if (quiet || result == -ENOMEM) + return result; + + if (result == -1) + msg = "TuxOnIce: Unable to find a signature." + " Could you have moved a swap file?\n"; + else if (!result) + msg = "TuxOnIce: No image found.\n"; + else if (result == 1) + msg = "TuxOnIce: Image found.\n"; + else if (result == 2) + msg = "TuxOnIce: uswsusp or swsusp image found.\n"; + else if (result == 3) + msg = "TuxOnIce: Old implementation's signature found.\n"; + + printk(KERN_INFO "%s", msg); + + return result; +} + +int toi_bio_scan_for_image(int quiet) +{ + struct block_device *bdev; + char default_name[255] = ""; + + if (!quiet) + printk(KERN_DEBUG "Scanning swap devices for TuxOnIce " + "signature...\n"); + for (bdev = next_bdev_of_type(NULL, "swap"); bdev; + bdev = next_bdev_of_type(bdev, "swap")) { + int result; + char name[255] = ""; + sprintf(name, "%u:%u", MAJOR(bdev->bd_dev), + MINOR(bdev->bd_dev)); + if (!quiet) + printk(KERN_DEBUG "- Trying %s.\n", name); + resume_block_device = bdev; + resume_dev_t = bdev->bd_dev; + + result = toi_check_for_signature(); + + resume_block_device = NULL; + resume_dev_t = MKDEV(0, 0); + + if (!default_name[0]) + strcpy(default_name, name); + + if (result == 1) { + /* Got one! */ + strcpy(resume_file, name); + next_bdev_of_type(bdev, NULL); + if (!quiet) + printk(KERN_DEBUG " ==> Image found on %s.\n", + resume_file); + return 1; + } + forget_signature_page(); + } + + if (!quiet) + printk(KERN_DEBUG "TuxOnIce scan: No image found.\n"); + strcpy(resume_file, default_name); + return 0; +} + +int toi_bio_get_header_version(void) +{ + return (memcmp(toi_sig_data->sig2, tuxonice_signature, + sizeof(tuxonice_signature))) ? + 0 : toi_sig_data->header_version; + +} diff --git b/kernel/power/tuxonice_builtin.c b/kernel/power/tuxonice_builtin.c new file mode 100644 index 0000000..22bf07a --- /dev/null +++ b/kernel/power/tuxonice_builtin.c @@ -0,0 +1,498 @@ +/* + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "tuxonice_io.h" +#include "tuxonice.h" +#include "tuxonice_extent.h" +#include "tuxonice_netlink.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_ui.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_pagedir.h" +#include "tuxonice_modules.h" +#include "tuxonice_builtin.h" +#include "tuxonice_power_off.h" +#include "tuxonice_alloc.h" + +unsigned long toi_bootflags_mask; + +/* + * Highmem related functions (x86 only). + */ + +#ifdef CONFIG_HIGHMEM + +/** + * copyback_high: Restore highmem pages. + * + * Highmem data and pbe lists are/can be stored in highmem. + * The format is slightly different to the lowmem pbe lists + * used for the assembly code: the last pbe in each page is + * a struct page * instead of struct pbe *, pointing to the + * next page where pbes are stored (or NULL if happens to be + * the end of the list). Since we don't want to generate + * unnecessary deltas against swsusp code, we use a cast + * instead of a union. + **/ + +static void copyback_high(void) +{ + struct page *pbe_page = (struct page *) restore_highmem_pblist; + struct pbe *this_pbe, *first_pbe; + unsigned long *origpage, *copypage; + int pbe_index = 1; + + if (!pbe_page) + return; + + this_pbe = (struct pbe *) kmap_atomic(pbe_page); + first_pbe = this_pbe; + + while (this_pbe) { + int loop = (PAGE_SIZE / sizeof(unsigned long)) - 1; + + origpage = kmap_atomic(pfn_to_page((unsigned long) this_pbe->orig_address)); + copypage = kmap_atomic((struct page *) this_pbe->address); + + while (loop >= 0) { + *(origpage + loop) = *(copypage + loop); + loop--; + } + + kunmap_atomic(origpage); + kunmap_atomic(copypage); + + if (!this_pbe->next) + break; + + if (pbe_index < PBES_PER_PAGE) { + this_pbe++; + pbe_index++; + } else { + pbe_page = (struct page *) this_pbe->next; + kunmap_atomic(first_pbe); + if (!pbe_page) + return; + this_pbe = (struct pbe *) kmap_atomic(pbe_page); + first_pbe = this_pbe; + pbe_index = 1; + } + } + kunmap_atomic(first_pbe); +} + +#else /* CONFIG_HIGHMEM */ +static void copyback_high(void) { } +#endif + +char toi_wait_for_keypress_dev_console(int timeout) +{ + int fd, this_timeout = 255, orig_kthread = 0; + char key = '\0'; + struct termios t, t_backup; + + /* We should be guaranteed /dev/console exists after populate_rootfs() + * in init/main.c. + */ + fd = sys_open("/dev/console", O_RDONLY, 0); + if (fd < 0) { + printk(KERN_INFO "Couldn't open /dev/console.\n"); + return key; + } + + if (sys_ioctl(fd, TCGETS, (long)&t) < 0) + goto out_close; + + memcpy(&t_backup, &t, sizeof(t)); + + t.c_lflag &= ~(ISIG|ICANON|ECHO); + t.c_cc[VMIN] = 0; + +new_timeout: + if (timeout > 0) { + this_timeout = timeout < 26 ? timeout : 25; + timeout -= this_timeout; + this_timeout *= 10; + } + + t.c_cc[VTIME] = this_timeout; + + if (sys_ioctl(fd, TCSETS, (long)&t) < 0) + goto out_restore; + + if (current->flags & PF_KTHREAD) { + orig_kthread = (current->flags & PF_KTHREAD); + current->flags &= ~PF_KTHREAD; + } + + while (1) { + if (sys_read(fd, &key, 1) <= 0) { + if (timeout) + goto new_timeout; + key = '\0'; + break; + } + key = tolower(key); + if (test_toi_state(TOI_SANITY_CHECK_PROMPT)) { + if (key == 'c') { + set_toi_state(TOI_CONTINUE_REQ); + break; + } else if (key == ' ') + break; + } else + break; + } + if (orig_kthread) { + current->flags |= PF_KTHREAD; + } + +out_restore: + sys_ioctl(fd, TCSETS, (long)&t_backup); +out_close: + sys_close(fd); + + return key; +} + +struct toi_boot_kernel_data toi_bkd __nosavedata + __attribute__((aligned(PAGE_SIZE))) = { + MY_BOOT_KERNEL_DATA_VERSION, + 0, +#ifdef CONFIG_TOI_REPLACE_SWSUSP + (1 << TOI_REPLACE_SWSUSP) | +#endif + (1 << TOI_NO_FLUSHER_THREAD) | + (1 << TOI_PAGESET2_FULL), +}; + +struct block_device *toi_open_by_devnum(dev_t dev) +{ + struct block_device *bdev = bdget(dev); + int err = -ENOMEM; + if (bdev) + err = blkdev_get(bdev, FMODE_READ | FMODE_NDELAY, NULL); + return err ? ERR_PTR(err) : bdev; +} + +/** + * toi_close_bdev: Close a swap bdev. + * + * int: The swap entry number to close. + */ +void toi_close_bdev(struct block_device *bdev) +{ + blkdev_put(bdev, FMODE_READ | FMODE_NDELAY); +} + +int toi_wait = CONFIG_TOI_DEFAULT_WAIT; +struct toi_core_fns *toi_core_fns; +unsigned long toi_result; +struct pagedir pagedir1 = {1}; +struct toi_cbw **toi_first_cbw; +int toi_next_cbw; + +unsigned long toi_get_nonconflicting_page(void) +{ + return toi_core_fns->get_nonconflicting_page(); +} + +int toi_post_context_save(void) +{ + return toi_core_fns->post_context_save(); +} + +int try_tuxonice_hibernate(void) +{ + if (!toi_core_fns) + return -ENODEV; + + return toi_core_fns->try_hibernate(); +} + +static int num_resume_calls; +#ifdef CONFIG_TOI_IGNORE_LATE_INITCALL +static int ignore_late_initcall = 1; +#else +static int ignore_late_initcall; +#endif + +int toi_translate_err_default = TOI_CONTINUE_REQ; + +void try_tuxonice_resume(void) +{ + if (!hibernation_available()) + return; + + /* Don't let it wrap around eventually */ + if (num_resume_calls < 2) + num_resume_calls++; + + if (num_resume_calls == 1 && ignore_late_initcall) { + printk(KERN_INFO "TuxOnIce: Ignoring late initcall, as requested.\n"); + return; + } + + if (toi_core_fns) + toi_core_fns->try_resume(); + else + printk(KERN_INFO "TuxOnIce core not loaded yet.\n"); +} + +int toi_lowlevel_builtin(void) +{ + int error = 0; + + save_processor_state(); + error = swsusp_arch_suspend(); + if (error) + printk(KERN_ERR "Error %d hibernating\n", error); + + /* Restore control flow appears here */ + if (!toi_in_hibernate) { + copyback_high(); + set_toi_state(TOI_NOW_RESUMING); + } + + restore_processor_state(); + return error; +} + +unsigned long toi_compress_bytes_in; +unsigned long toi_compress_bytes_out; + +int toi_in_suspend(void) +{ + return in_suspend; +} + +unsigned long toi_state = ((1 << TOI_BOOT_TIME) | + (1 << TOI_IGNORE_LOGLEVEL) | + (1 << TOI_IO_STOPPED)); + +/* The number of hibernates we have started (some may have been cancelled) */ +unsigned int nr_hibernates; +int toi_running; +__nosavedata int toi_in_hibernate; +__nosavedata struct pbe *restore_highmem_pblist; + +int toi_trace_allocs; + +void toi_read_lock_tasklist(void) +{ + read_lock(&tasklist_lock); +} + +void toi_read_unlock_tasklist(void) +{ + read_unlock(&tasklist_lock); +} + +#ifdef CONFIG_TOI_ZRAM_SUPPORT +int (*toi_flag_zram_disks) (void); + +int toi_do_flag_zram_disks(void) +{ + return toi_flag_zram_disks ? (*toi_flag_zram_disks)() : 0; +} + +#endif + +/* toi_generate_free_page_map + * + * Description: This routine generates a bitmap of free pages from the + * lists used by the memory manager. We then use the bitmap + * to quickly calculate which pages to save and in which + * pagesets. + */ +void toi_generate_free_page_map(void) +{ + int order, cpu, t; + unsigned long flags, i; + struct zone *zone; + struct list_head *curr; + unsigned long pfn; + struct page *page; + + for_each_populated_zone(zone) { + + if (!zone->spanned_pages) + continue; + + spin_lock_irqsave(&zone->lock, flags); + + for (i = 0; i < zone->spanned_pages; i++) { + pfn = zone->zone_start_pfn + i; + + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + + ClearPageNosaveFree(page); + } + + for_each_migratetype_order(order, t) { + list_for_each(curr, + &zone->free_area[order].free_list[t]) { + unsigned long j; + + pfn = page_to_pfn(list_entry(curr, struct page, + lru)); + for (j = 0; j < (1UL << order); j++) + SetPageNosaveFree(pfn_to_page(pfn + j)); + } + } + + for_each_online_cpu(cpu) { + struct per_cpu_pageset *pset = + per_cpu_ptr(zone->pageset, cpu); + struct per_cpu_pages *pcp = &pset->pcp; + struct page *page; + int t; + + for (t = 0; t < MIGRATE_PCPTYPES; t++) + list_for_each_entry(page, &pcp->lists[t], lru) + SetPageNosaveFree(page); + } + + spin_unlock_irqrestore(&zone->lock, flags); + } +} + +/* toi_size_of_free_region + * + * Description: Return the number of pages that are free, beginning with and + * including this one. + */ +int toi_size_of_free_region(struct zone *zone, unsigned long start_pfn) +{ + unsigned long this_pfn = start_pfn, + end_pfn = zone_end_pfn(zone); + + while (pfn_valid(this_pfn) && this_pfn < end_pfn && PageNosaveFree(pfn_to_page(this_pfn))) + this_pfn++; + + return this_pfn - start_pfn; +} + +static int __init toi_wait_setup(char *str) +{ + int value; + + if (sscanf(str, "=%d", &value)) { + if (value < -1 || value > 255) + printk(KERN_INFO "TuxOnIce_wait outside range -1 to " + "255.\n"); + else + toi_wait = value; + } + + return 1; +} +__setup("toi_wait", toi_wait_setup); + +static int __init toi_translate_retry_setup(char *str) +{ + toi_translate_err_default = 0; + return 1; +} +__setup("toi_translate_retry", toi_translate_retry_setup); + +static int __init toi_debug_setup(char *str) +{ + toi_bkd.toi_action |= (1 << TOI_LOGALL); + toi_bootflags_mask |= (1 << TOI_LOGALL); + toi_bkd.toi_debug_state = 255; + toi_bkd.toi_default_console_level = 7; + return 1; +} +__setup("toi_debug_setup", toi_debug_setup); + +static int __init toi_pause_setup(char *str) +{ + toi_bkd.toi_action |= (1 << TOI_PAUSE); + toi_bootflags_mask |= (1 << TOI_PAUSE); + return 1; +} +__setup("toi_pause", toi_pause_setup); + +#ifdef CONFIG_PM_DEBUG +static int __init toi_trace_allocs_setup(char *str) +{ + int value; + + if (sscanf(str, "=%d", &value)) + toi_trace_allocs = value; + + return 1; +} +__setup("toi_trace_allocs", toi_trace_allocs_setup); +#endif + +static int __init toi_ignore_late_initcall_setup(char *str) +{ + int value; + + if (sscanf(str, "=%d", &value)) + ignore_late_initcall = value; + + return 1; +} +__setup("toi_initramfs_resume_only", toi_ignore_late_initcall_setup); + +static int __init toi_force_no_multithreaded_setup(char *str) +{ + int value; + + toi_bkd.toi_action &= ~(1 << TOI_NO_MULTITHREADED_IO); + toi_bootflags_mask |= (1 << TOI_NO_MULTITHREADED_IO); + + if (sscanf(str, "=%d", &value) && value) + toi_bkd.toi_action |= (1 << TOI_NO_MULTITHREADED_IO); + + return 1; +} +__setup("toi_no_multithreaded", toi_force_no_multithreaded_setup); + +#ifdef CONFIG_KGDB +static int __init toi_post_resume_breakpoint_setup(char *str) +{ + int value; + + toi_bkd.toi_action &= ~(1 << TOI_POST_RESUME_BREAKPOINT); + toi_bootflags_mask |= (1 << TOI_POST_RESUME_BREAKPOINT); + if (sscanf(str, "=%d", &value) && value) + toi_bkd.toi_action |= (1 << TOI_POST_RESUME_BREAKPOINT); + + return 1; +} +__setup("toi_post_resume_break", toi_post_resume_breakpoint_setup); +#endif + +static int __init toi_disable_readahead_setup(char *str) +{ + int value; + + toi_bkd.toi_action &= ~(1 << TOI_NO_READAHEAD); + toi_bootflags_mask |= (1 << TOI_NO_READAHEAD); + if (sscanf(str, "=%d", &value) && value) + toi_bkd.toi_action |= (1 << TOI_NO_READAHEAD); + + return 1; +} +__setup("toi_no_readahead", toi_disable_readahead_setup); diff --git b/kernel/power/tuxonice_builtin.h b/kernel/power/tuxonice_builtin.h new file mode 100644 index 0000000..9539818 --- /dev/null +++ b/kernel/power/tuxonice_builtin.h @@ -0,0 +1,41 @@ +/* + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + */ +#include + +extern struct toi_core_fns *toi_core_fns; +extern unsigned long toi_compress_bytes_in, toi_compress_bytes_out; +extern unsigned int nr_hibernates; +extern int toi_in_hibernate; + +extern __nosavedata struct pbe *restore_highmem_pblist; + +int toi_lowlevel_builtin(void); + +#ifdef CONFIG_HIGHMEM +extern __nosavedata struct zone_data *toi_nosave_zone_list; +extern __nosavedata unsigned long toi_nosave_max_pfn; +#endif + +extern unsigned long toi_get_nonconflicting_page(void); +extern int toi_post_context_save(void); + +extern char toi_wait_for_keypress_dev_console(int timeout); +extern struct block_device *toi_open_by_devnum(dev_t dev); +extern void toi_close_bdev(struct block_device *bdev); +extern int toi_wait; +extern int toi_translate_err_default; +extern int toi_force_no_multithreaded; +extern void toi_read_lock_tasklist(void); +extern void toi_read_unlock_tasklist(void); +extern int toi_in_suspend(void); +extern void toi_generate_free_page_map(void); +extern int toi_size_of_free_region(struct zone *zone, unsigned long start_pfn); + +#ifdef CONFIG_TOI_ZRAM_SUPPORT +extern int toi_do_flag_zram_disks(void); +#else +#define toi_do_flag_zram_disks() (0) +#endif diff --git b/kernel/power/tuxonice_checksum.c b/kernel/power/tuxonice_checksum.c new file mode 100644 index 0000000..1c4e10c --- /dev/null +++ b/kernel/power/tuxonice_checksum.c @@ -0,0 +1,392 @@ +/* + * kernel/power/tuxonice_checksum.c + * + * Copyright (C) 2006-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * This file contains data checksum routines for TuxOnIce, + * using cryptoapi. They are used to locate any modifications + * made to pageset 2 while we're saving it. + */ + +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_io.h" +#include "tuxonice_pageflags.h" +#include "tuxonice_checksum.h" +#include "tuxonice_pagedir.h" +#include "tuxonice_alloc.h" +#include "tuxonice_ui.h" + +static struct toi_module_ops toi_checksum_ops; + +/* Constant at the mo, but I might allow tuning later */ +static char toi_checksum_name[32] = "md4"; +/* Bytes per checksum */ +#define CHECKSUM_SIZE (16) + +#define CHECKSUMS_PER_PAGE ((PAGE_SIZE - sizeof(void *)) / CHECKSUM_SIZE) + +struct cpu_context { + struct crypto_hash *transform; + struct hash_desc desc; + struct scatterlist sg[2]; + char *buf; +}; + +static DEFINE_PER_CPU(struct cpu_context, contexts); +static int pages_allocated; +static unsigned long page_list; + +static int toi_num_resaved; + +static unsigned long this_checksum, next_page; +static int checksum_count; + +static inline int checksum_pages_needed(void) +{ + return DIV_ROUND_UP(pagedir2.size, CHECKSUMS_PER_PAGE); +} + +/* ---- Local buffer management ---- */ + +/* + * toi_checksum_cleanup + * + * Frees memory allocated for our labours. + */ +static void toi_checksum_cleanup(int ending_cycle) +{ + int cpu; + + if (ending_cycle) { + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + if (this->transform) { + crypto_free_hash(this->transform); + this->transform = NULL; + this->desc.tfm = NULL; + } + + if (this->buf) { + toi_free_page(27, (unsigned long) this->buf); + this->buf = NULL; + } + } + } +} + +/* + * toi_crypto_initialise + * + * Prepare to do some work by allocating buffers and transforms. + * Returns: Int: Zero. Even if we can't set up checksum, we still + * seek to hibernate. + */ +static int toi_checksum_initialise(int starting_cycle) +{ + int cpu; + + if (!(starting_cycle & SYSFS_HIBERNATE) || !toi_checksum_ops.enabled) + return 0; + + if (!*toi_checksum_name) { + printk(KERN_INFO "TuxOnIce: No checksum algorithm name set.\n"); + return 1; + } + + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + struct page *page; + + this->transform = crypto_alloc_hash(toi_checksum_name, 0, 0); + if (IS_ERR(this->transform)) { + printk(KERN_INFO "TuxOnIce: Failed to initialise the " + "%s checksum algorithm: %ld.\n", + toi_checksum_name, (long) this->transform); + this->transform = NULL; + return 1; + } + + this->desc.tfm = this->transform; + this->desc.flags = 0; + + page = toi_alloc_page(27, GFP_KERNEL); + if (!page) + return 1; + this->buf = page_address(page); + sg_init_one(&this->sg[0], this->buf, PAGE_SIZE); + } + return 0; +} + +/* + * toi_checksum_print_debug_stats + * @buffer: Pointer to a buffer into which the debug info will be printed. + * @size: Size of the buffer. + * + * Print information to be recorded for debugging purposes into a buffer. + * Returns: Number of characters written to the buffer. + */ + +static int toi_checksum_print_debug_stats(char *buffer, int size) +{ + int len; + + if (!toi_checksum_ops.enabled) + return scnprintf(buffer, size, + "- Checksumming disabled.\n"); + + len = scnprintf(buffer, size, "- Checksum method is '%s'.\n", + toi_checksum_name); + len += scnprintf(buffer + len, size - len, + " %d pages resaved in atomic copy.\n", toi_num_resaved); + return len; +} + +static int toi_checksum_memory_needed(void) +{ + return toi_checksum_ops.enabled ? + checksum_pages_needed() << PAGE_SHIFT : 0; +} + +static int toi_checksum_storage_needed(void) +{ + if (toi_checksum_ops.enabled) + return strlen(toi_checksum_name) + sizeof(int) + 1; + else + return 0; +} + +/* + * toi_checksum_save_config_info + * @buffer: Pointer to a buffer of size PAGE_SIZE. + * + * Save informaton needed when reloading the image at resume time. + * Returns: Number of bytes used for saving our data. + */ +static int toi_checksum_save_config_info(char *buffer) +{ + int namelen = strlen(toi_checksum_name) + 1; + int total_len; + + *((unsigned int *) buffer) = namelen; + strncpy(buffer + sizeof(unsigned int), toi_checksum_name, namelen); + total_len = sizeof(unsigned int) + namelen; + return total_len; +} + +/* toi_checksum_load_config_info + * @buffer: Pointer to the start of the data. + * @size: Number of bytes that were saved. + * + * Description: Reload information needed for dechecksuming the image at + * resume time. + */ +static void toi_checksum_load_config_info(char *buffer, int size) +{ + int namelen; + + namelen = *((unsigned int *) (buffer)); + strncpy(toi_checksum_name, buffer + sizeof(unsigned int), + namelen); + return; +} + +/* + * Free Checksum Memory + */ + +void free_checksum_pages(void) +{ + while (pages_allocated) { + unsigned long next = *((unsigned long *) page_list); + ClearPageNosave(virt_to_page(page_list)); + toi_free_page(15, (unsigned long) page_list); + page_list = next; + pages_allocated--; + } +} + +/* + * Allocate Checksum Memory + */ + +int allocate_checksum_pages(void) +{ + int pages_needed = checksum_pages_needed(); + + if (!toi_checksum_ops.enabled) + return 0; + + while (pages_allocated < pages_needed) { + unsigned long *new_page = + (unsigned long *) toi_get_zeroed_page(15, TOI_ATOMIC_GFP); + if (!new_page) { + printk(KERN_ERR "Unable to allocate checksum pages.\n"); + return -ENOMEM; + } + SetPageNosave(virt_to_page(new_page)); + (*new_page) = page_list; + page_list = (unsigned long) new_page; + pages_allocated++; + } + + next_page = (unsigned long) page_list; + checksum_count = 0; + + return 0; +} + +char *tuxonice_get_next_checksum(void) +{ + if (!toi_checksum_ops.enabled) + return NULL; + + if (checksum_count % CHECKSUMS_PER_PAGE) + this_checksum += CHECKSUM_SIZE; + else { + this_checksum = next_page + sizeof(void *); + next_page = *((unsigned long *) next_page); + } + + checksum_count++; + return (char *) this_checksum; +} + +int tuxonice_calc_checksum(struct page *page, char *checksum_locn) +{ + char *pa; + int result, cpu = smp_processor_id(); + struct cpu_context *ctx = &per_cpu(contexts, cpu); + + if (!toi_checksum_ops.enabled) + return 0; + + pa = kmap(page); + memcpy(ctx->buf, pa, PAGE_SIZE); + kunmap(page); + result = crypto_hash_digest(&ctx->desc, ctx->sg, PAGE_SIZE, + checksum_locn); + if (result) + printk(KERN_ERR "TuxOnIce checksumming: crypto_hash_digest " + "returned %d.\n", result); + return result; +} +/* + * Calculate checksums + */ + +void check_checksums(void) +{ + int index = 0, cpu = smp_processor_id(); + char current_checksum[CHECKSUM_SIZE]; + struct cpu_context *ctx = &per_cpu(contexts, cpu); + unsigned long pfn; + + if (!toi_checksum_ops.enabled) { + toi_message(TOI_IO, TOI_VERBOSE, 0, "Checksumming disabled."); + return; + } + + next_page = (unsigned long) page_list; + + toi_num_resaved = 0; + this_checksum = 0; + + toi_trace_index++; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Verifying checksums."); + memory_bm_position_reset(pageset2_map); + for (pfn = memory_bm_next_pfn(pageset2_map, 0); pfn != BM_END_OF_MAP; + pfn = memory_bm_next_pfn(pageset2_map, 0)) { + int ret, resave_needed = false; + char *pa; + struct page *page = pfn_to_page(pfn); + + if (index < checksum_count) { + if (index % CHECKSUMS_PER_PAGE) { + this_checksum += CHECKSUM_SIZE; + } else { + this_checksum = next_page + sizeof(void *); + next_page = *((unsigned long *) next_page); + } + + /* Done when IRQs disabled so must be atomic */ + pa = kmap_atomic(page); + memcpy(ctx->buf, pa, PAGE_SIZE); + kunmap_atomic(pa); + ret = crypto_hash_digest(&ctx->desc, ctx->sg, PAGE_SIZE, + current_checksum); + + if (ret) { + printk(KERN_INFO "Digest failed. Returned %d.\n", ret); + return; + } + + resave_needed = memcmp(current_checksum, (char *) this_checksum, + CHECKSUM_SIZE); + } else { + resave_needed = true; + } + + if (resave_needed) { + TOI_TRACE_DEBUG(pfn, "_Resaving %d", resave_needed); + SetPageResave(pfn_to_page(pfn)); + toi_num_resaved++; + if (test_action_state(TOI_ABORT_ON_RESAVE_NEEDED)) + set_abort_result(TOI_RESAVE_NEEDED); + } + + index++; + } + toi_message(TOI_IO, TOI_VERBOSE, 0, "Checksum verification complete."); +} + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_INT("enabled", SYSFS_RW, &toi_checksum_ops.enabled, 0, 1, 0, + NULL), + SYSFS_BIT("abort_if_resave_needed", SYSFS_RW, &toi_bkd.toi_action, + TOI_ABORT_ON_RESAVE_NEEDED, 0) +}; + +/* + * Ops structure. + */ +static struct toi_module_ops toi_checksum_ops = { + .type = MISC_MODULE, + .name = "checksumming", + .directory = "checksum", + .module = THIS_MODULE, + .initialise = toi_checksum_initialise, + .cleanup = toi_checksum_cleanup, + .print_debug_info = toi_checksum_print_debug_stats, + .save_config_info = toi_checksum_save_config_info, + .load_config_info = toi_checksum_load_config_info, + .memory_needed = toi_checksum_memory_needed, + .storage_needed = toi_checksum_storage_needed, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ +int toi_checksum_init(void) +{ + int result = toi_register_module(&toi_checksum_ops); + return result; +} + +void toi_checksum_exit(void) +{ + toi_unregister_module(&toi_checksum_ops); +} diff --git b/kernel/power/tuxonice_checksum.h b/kernel/power/tuxonice_checksum.h new file mode 100644 index 0000000..c8196fb --- /dev/null +++ b/kernel/power/tuxonice_checksum.h @@ -0,0 +1,31 @@ +/* + * kernel/power/tuxonice_checksum.h + * + * Copyright (C) 2006-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * This file contains data checksum routines for TuxOnIce, + * using cryptoapi. They are used to locate any modifications + * made to pageset 2 while we're saving it. + */ + +#if defined(CONFIG_TOI_CHECKSUM) +extern int toi_checksum_init(void); +extern void toi_checksum_exit(void); +void check_checksums(void); +int allocate_checksum_pages(void); +void free_checksum_pages(void); +char *tuxonice_get_next_checksum(void); +int tuxonice_calc_checksum(struct page *page, char *checksum_locn); +#else +static inline int toi_checksum_init(void) { return 0; } +static inline void toi_checksum_exit(void) { } +static inline void check_checksums(void) { }; +static inline int allocate_checksum_pages(void) { return 0; }; +static inline void free_checksum_pages(void) { }; +static inline char *tuxonice_get_next_checksum(void) { return NULL; }; +static inline int tuxonice_calc_checksum(struct page *page, char *checksum_locn) + { return 0; } +#endif + diff --git b/kernel/power/tuxonice_cluster.c b/kernel/power/tuxonice_cluster.c new file mode 100644 index 0000000..2873f93 --- /dev/null +++ b/kernel/power/tuxonice_cluster.c @@ -0,0 +1,1058 @@ +/* + * kernel/power/tuxonice_cluster.c + * + * Copyright (C) 2006-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * This file contains routines for cluster hibernation support. + * + * Based on ip autoconfiguration code in net/ipv4/ipconfig.c. + * + * How does it work? + * + * There is no 'master' node that tells everyone else what to do. All nodes + * send messages to the broadcast address/port, maintain a list of peers + * and figure out when to progress to the next step in hibernating or resuming. + * This makes us more fault tolerant when it comes to nodes coming and going + * (which may be more of an issue if we're hibernating when power supplies + * are being unreliable). + * + * At boot time, we start a ktuxonice thread that handles communication with + * other nodes. This node maintains a state machine that controls our progress + * through hibernating and resuming, keeping us in step with other nodes. Nodes + * are identified by their hw address. + * + * On startup, the node sends CLUSTER_PING on the configured interface's + * broadcast address, port $toi_cluster_port (see below) and begins to listen + * for other broadcast messages. CLUSTER_PING messages are repeated at + * intervals of 5 minutes, with a random offset to spread traffic out. + * + * A hibernation cycle is initiated from any node via + * + * echo > /sys/power/tuxonice/do_hibernate + * + * and (possibily) the hibernate script. At each step of the process, the node + * completes its work, and waits for all other nodes to signal completion of + * their work (or timeout) before progressing to the next step. + * + * Request/state Action before reply Possible reply Next state + * HIBERNATE capable, pre-script HIBERNATE|ACK NODE_PREP + * HIBERNATE|NACK INIT_0 + * + * PREP prepare_image PREP|ACK IMAGE_WRITE + * PREP|NACK INIT_0 + * ABORT RUNNING + * + * IO write image IO|ACK power off + * ABORT POST_RESUME + * + * (Boot time) check for image IMAGE|ACK RESUME_PREP + * (Note 1) + * IMAGE|NACK (Note 2) + * + * PREP prepare read image PREP|ACK IMAGE_READ + * PREP|NACK (As NACK_IMAGE) + * + * IO read image IO|ACK POST_RESUME + * + * POST_RESUME thaw, post-script RUNNING + * + * INIT_0 init 0 + * + * Other messages: + * + * - PING: Request for all other live nodes to send a PONG. Used at startup to + * announce presence, when a node is suspected dead and periodically, in case + * segments of the network are [un]plugged. + * + * - PONG: Response to a PING. + * + * - ABORT: Request to cancel writing an image. + * + * - BYE: Notification that this node is shutting down. + * + * Note 1: Repeated at 3s intervals until we continue to boot/resume, so that + * nodes which are slower to start up can get state synchronised. If a node + * starting up sees other nodes sending RESUME_PREP or IMAGE_READ, it may send + * ACK_IMAGE and they will wait for it to catch up. If it sees ACK_READ, it + * must invalidate its image (if any) and boot normally. + * + * Note 2: May occur when one node lost power or powered off while others + * hibernated. This node waits for others to complete resuming (ACK_READ) + * before completing its boot, so that it appears as a fail node restarting. + * + * If any node has an image, then it also has a list of nodes that hibernated + * in synchronisation with it. The node will wait for other nodes to appear + * or timeout before beginning its restoration. + * + * If a node has no image, it needs to wait, in case other nodes which do have + * an image are going to resume, but are taking longer to announce their + * presence. For this reason, the user can specify a timeout value and a number + * of nodes detected before we just continue. (We might want to assume in a + * cluster of, say, 15 nodes, if 8 others have booted without finding an image, + * the remaining nodes will too. This might help in situations where some nodes + * are much slower to boot, or more subject to hardware failures or such like). + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_alloc.h" +#include "tuxonice_io.h" + +#if 1 +#define PRINTK(a, b...) do { printk(a, ##b); } while (0) +#else +#define PRINTK(a, b...) do { } while (0) +#endif + +static int loopback_mode; +static int num_local_nodes = 1; +#define MAX_LOCAL_NODES 8 +#define SADDR (loopback_mode ? b->sid : h->saddr) + +#define MYNAME "TuxOnIce Clustering" + +enum cluster_message { + MSG_ACK = 1, + MSG_NACK = 2, + MSG_PING = 4, + MSG_ABORT = 8, + MSG_BYE = 16, + MSG_HIBERNATE = 32, + MSG_IMAGE = 64, + MSG_IO = 128, + MSG_RUNNING = 256 +}; + +static char *str_message(int message) +{ + switch (message) { + case 4: + return "Ping"; + case 8: + return "Abort"; + case 9: + return "Abort acked"; + case 10: + return "Abort nacked"; + case 16: + return "Bye"; + case 17: + return "Bye acked"; + case 18: + return "Bye nacked"; + case 32: + return "Hibernate request"; + case 33: + return "Hibernate ack"; + case 34: + return "Hibernate nack"; + case 64: + return "Image exists?"; + case 65: + return "Image does exist"; + case 66: + return "No image here"; + case 128: + return "I/O"; + case 129: + return "I/O okay"; + case 130: + return "I/O failed"; + case 256: + return "Running"; + default: + printk(KERN_ERR "Unrecognised message %d.\n", message); + return "Unrecognised message (see dmesg)"; + } +} + +#define MSG_ACK_MASK (MSG_ACK | MSG_NACK) +#define MSG_STATE_MASK (~MSG_ACK_MASK) + +struct node_info { + struct list_head member_list; + wait_queue_head_t member_events; + spinlock_t member_list_lock; + spinlock_t receive_lock; + int peer_count, ignored_peer_count; + struct toi_sysfs_data sysfs_data; + enum cluster_message current_message; +}; + +struct node_info node_array[MAX_LOCAL_NODES]; + +struct cluster_member { + __be32 addr; + enum cluster_message message; + struct list_head list; + int ignore; +}; + +#define toi_cluster_port_send 3501 +#define toi_cluster_port_recv 3502 + +static struct net_device *net_dev; +static struct toi_module_ops toi_cluster_ops; + +static int toi_recv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pt, struct net_device *orig_dev); + +static struct packet_type toi_cluster_packet_type = { + .type = __constant_htons(ETH_P_IP), + .func = toi_recv, +}; + +struct toi_pkt { /* BOOTP packet format */ + struct iphdr iph; /* IP header */ + struct udphdr udph; /* UDP header */ + u8 htype; /* HW address type */ + u8 hlen; /* HW address length */ + __be32 xid; /* Transaction ID */ + __be16 secs; /* Seconds since we started */ + __be16 flags; /* Just what it says */ + u8 hw_addr[16]; /* Sender's HW address */ + u16 message; /* Message */ + unsigned long sid; /* Source ID for loopback testing */ +}; + +static char toi_cluster_iface[IFNAMSIZ] = CONFIG_TOI_DEFAULT_CLUSTER_INTERFACE; + +static int added_pack; + +static int others_have_image; + +/* Key used to allow multiple clusters on the same lan */ +static char toi_cluster_key[32] = CONFIG_TOI_DEFAULT_CLUSTER_KEY; +static char pre_hibernate_script[255] = + CONFIG_TOI_DEFAULT_CLUSTER_PRE_HIBERNATE; +static char post_hibernate_script[255] = + CONFIG_TOI_DEFAULT_CLUSTER_POST_HIBERNATE; + +/* List of cluster members */ +static unsigned long continue_delay = 5 * HZ; +static unsigned long cluster_message_timeout = 3 * HZ; + +/* === Membership list === */ + +static void print_member_info(int index) +{ + struct cluster_member *this; + + printk(KERN_INFO "==> Dumping node %d.\n", index); + + list_for_each_entry(this, &node_array[index].member_list, list) + printk(KERN_INFO "%d.%d.%d.%d last message %s. %s\n", + NIPQUAD(this->addr), + str_message(this->message), + this->ignore ? "(Ignored)" : ""); + printk(KERN_INFO "== Done ==\n"); +} + +static struct cluster_member *__find_member(int index, __be32 addr) +{ + struct cluster_member *this; + + list_for_each_entry(this, &node_array[index].member_list, list) { + if (this->addr != addr) + continue; + + return this; + } + + return NULL; +} + +static void set_ignore(int index, __be32 addr, struct cluster_member *this) +{ + if (this->ignore) { + PRINTK("Node %d already ignoring %d.%d.%d.%d.\n", + index, NIPQUAD(addr)); + return; + } + + PRINTK("Node %d sees node %d.%d.%d.%d now being ignored.\n", + index, NIPQUAD(addr)); + this->ignore = 1; + node_array[index].ignored_peer_count++; +} + +static int __add_update_member(int index, __be32 addr, int message) +{ + struct cluster_member *this; + + this = __find_member(index, addr); + if (this) { + if (this->message != message) { + this->message = message; + if ((message & MSG_NACK) && + (message & (MSG_HIBERNATE | MSG_IMAGE | MSG_IO))) + set_ignore(index, addr, this); + PRINTK("Node %d sees node %d.%d.%d.%d now sending " + "%s.\n", index, NIPQUAD(addr), + str_message(message)); + wake_up(&node_array[index].member_events); + } + return 0; + } + + this = (struct cluster_member *) toi_kzalloc(36, + sizeof(struct cluster_member), GFP_KERNEL); + + if (!this) + return -1; + + this->addr = addr; + this->message = message; + this->ignore = 0; + INIT_LIST_HEAD(&this->list); + + node_array[index].peer_count++; + + PRINTK("Node %d sees node %d.%d.%d.%d sending %s.\n", index, + NIPQUAD(addr), str_message(message)); + + if ((message & MSG_NACK) && + (message & (MSG_HIBERNATE | MSG_IMAGE | MSG_IO))) + set_ignore(index, addr, this); + list_add_tail(&this->list, &node_array[index].member_list); + return 1; +} + +static int add_update_member(int index, __be32 addr, int message) +{ + int result; + unsigned long flags; + spin_lock_irqsave(&node_array[index].member_list_lock, flags); + result = __add_update_member(index, addr, message); + spin_unlock_irqrestore(&node_array[index].member_list_lock, flags); + + print_member_info(index); + + wake_up(&node_array[index].member_events); + + return result; +} + +static void del_member(int index, __be32 addr) +{ + struct cluster_member *this; + unsigned long flags; + + spin_lock_irqsave(&node_array[index].member_list_lock, flags); + this = __find_member(index, addr); + + if (this) { + list_del_init(&this->list); + toi_kfree(36, this, sizeof(*this)); + node_array[index].peer_count--; + } + + spin_unlock_irqrestore(&node_array[index].member_list_lock, flags); +} + +/* === Message transmission === */ + +static void toi_send_if(int message, unsigned long my_id); + +/* + * Process received TOI packet. + */ +static int toi_recv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pt, struct net_device *orig_dev) +{ + struct toi_pkt *b; + struct iphdr *h; + int len, result, index; + unsigned long addr, message, ack; + + /* Perform verifications before taking the lock. */ + if (skb->pkt_type == PACKET_OTHERHOST) + goto drop; + + if (dev != net_dev) + goto drop; + + skb = skb_share_check(skb, GFP_ATOMIC); + if (!skb) + return NET_RX_DROP; + + if (!pskb_may_pull(skb, + sizeof(struct iphdr) + + sizeof(struct udphdr))) + goto drop; + + b = (struct toi_pkt *)skb_network_header(skb); + h = &b->iph; + + if (h->ihl != 5 || h->version != 4 || h->protocol != IPPROTO_UDP) + goto drop; + + /* Fragments are not supported */ + if (h->frag_off & htons(IP_OFFSET | IP_MF)) { + if (net_ratelimit()) + printk(KERN_ERR "TuxOnIce: Ignoring fragmented " + "cluster message.\n"); + goto drop; + } + + if (skb->len < ntohs(h->tot_len)) + goto drop; + + if (ip_fast_csum((char *) h, h->ihl)) + goto drop; + + if (b->udph.source != htons(toi_cluster_port_send) || + b->udph.dest != htons(toi_cluster_port_recv)) + goto drop; + + if (ntohs(h->tot_len) < ntohs(b->udph.len) + sizeof(struct iphdr)) + goto drop; + + len = ntohs(b->udph.len) - sizeof(struct udphdr); + + /* Ok the front looks good, make sure we can get at the rest. */ + if (!pskb_may_pull(skb, skb->len)) + goto drop; + + b = (struct toi_pkt *)skb_network_header(skb); + h = &b->iph; + + addr = SADDR; + PRINTK(">>> Message %s received from " NIPQUAD_FMT ".\n", + str_message(b->message), NIPQUAD(addr)); + + message = b->message & MSG_STATE_MASK; + ack = b->message & MSG_ACK_MASK; + + for (index = 0; index < num_local_nodes; index++) { + int new_message = node_array[index].current_message, + old_message = new_message; + + if (index == SADDR || !old_message) { + PRINTK("Ignoring node %d (offline or self).\n", index); + continue; + } + + /* One message at a time, please. */ + spin_lock(&node_array[index].receive_lock); + + result = add_update_member(index, SADDR, b->message); + if (result == -1) { + printk(KERN_INFO "Failed to add new cluster member " + NIPQUAD_FMT ".\n", + NIPQUAD(addr)); + goto drop_unlock; + } + + switch (b->message & MSG_STATE_MASK) { + case MSG_PING: + break; + case MSG_ABORT: + break; + case MSG_BYE: + break; + case MSG_HIBERNATE: + /* Can I hibernate? */ + new_message = MSG_HIBERNATE | + ((index & 1) ? MSG_NACK : MSG_ACK); + break; + case MSG_IMAGE: + /* Can I resume? */ + new_message = MSG_IMAGE | + ((index & 1) ? MSG_NACK : MSG_ACK); + if (new_message != old_message) + printk(KERN_ERR "Setting whether I can resume " + "to %d.\n", new_message); + break; + case MSG_IO: + new_message = MSG_IO | MSG_ACK; + break; + case MSG_RUNNING: + break; + default: + if (net_ratelimit()) + printk(KERN_ERR "Unrecognised TuxOnIce cluster" + " message %d from " NIPQUAD_FMT ".\n", + b->message, NIPQUAD(addr)); + }; + + if (old_message != new_message) { + node_array[index].current_message = new_message; + printk(KERN_INFO ">>> Sending new message for node " + "%d.\n", index); + toi_send_if(new_message, index); + } else if (!ack) { + printk(KERN_INFO ">>> Resending message for node %d.\n", + index); + toi_send_if(new_message, index); + } +drop_unlock: + spin_unlock(&node_array[index].receive_lock); + }; + +drop: + /* Throw the packet out. */ + kfree_skb(skb); + + return 0; +} + +/* + * Send cluster message to single interface. + */ +static void toi_send_if(int message, unsigned long my_id) +{ + struct sk_buff *skb; + struct toi_pkt *b; + int hh_len = LL_RESERVED_SPACE(net_dev); + struct iphdr *h; + + /* Allocate packet */ + skb = alloc_skb(sizeof(struct toi_pkt) + hh_len + 15, GFP_KERNEL); + if (!skb) + return; + skb_reserve(skb, hh_len); + b = (struct toi_pkt *) skb_put(skb, sizeof(struct toi_pkt)); + memset(b, 0, sizeof(struct toi_pkt)); + + /* Construct IP header */ + skb_reset_network_header(skb); + h = ip_hdr(skb); + h->version = 4; + h->ihl = 5; + h->tot_len = htons(sizeof(struct toi_pkt)); + h->frag_off = htons(IP_DF); + h->ttl = 64; + h->protocol = IPPROTO_UDP; + h->daddr = htonl(INADDR_BROADCAST); + h->check = ip_fast_csum((unsigned char *) h, h->ihl); + + /* Construct UDP header */ + b->udph.source = htons(toi_cluster_port_send); + b->udph.dest = htons(toi_cluster_port_recv); + b->udph.len = htons(sizeof(struct toi_pkt) - sizeof(struct iphdr)); + /* UDP checksum not calculated -- explicitly allowed in BOOTP RFC */ + + /* Construct message */ + b->message = message; + b->sid = my_id; + b->htype = net_dev->type; /* can cause undefined behavior */ + b->hlen = net_dev->addr_len; + memcpy(b->hw_addr, net_dev->dev_addr, net_dev->addr_len); + b->secs = htons(3); /* 3 seconds */ + + /* Chain packet down the line... */ + skb->dev = net_dev; + skb->protocol = htons(ETH_P_IP); + if ((dev_hard_header(skb, net_dev, ntohs(skb->protocol), + net_dev->broadcast, net_dev->dev_addr, skb->len) < 0) || + dev_queue_xmit(skb) < 0) + printk(KERN_INFO "E"); +} + +/* ========================================= */ + +/* kTOICluster */ + +static atomic_t num_cluster_threads; +static DECLARE_WAIT_QUEUE_HEAD(clusterd_events); + +static int kTOICluster(void *data) +{ + unsigned long my_id; + + my_id = atomic_add_return(1, &num_cluster_threads) - 1; + node_array[my_id].current_message = (unsigned long) data; + + PRINTK("kTOICluster daemon %lu starting.\n", my_id); + + current->flags |= PF_NOFREEZE; + + while (node_array[my_id].current_message) { + toi_send_if(node_array[my_id].current_message, my_id); + sleep_on_timeout(&clusterd_events, + cluster_message_timeout); + PRINTK("Link state %lu is %d.\n", my_id, + node_array[my_id].current_message); + } + + toi_send_if(MSG_BYE, my_id); + atomic_dec(&num_cluster_threads); + wake_up(&clusterd_events); + + PRINTK("kTOICluster daemon %lu exiting.\n", my_id); + __set_current_state(TASK_RUNNING); + return 0; +} + +static void kill_clusterd(void) +{ + int i; + + for (i = 0; i < num_local_nodes; i++) { + if (node_array[i].current_message) { + PRINTK("Seeking to kill clusterd %d.\n", i); + node_array[i].current_message = 0; + } + } + wait_event(clusterd_events, + !atomic_read(&num_cluster_threads)); + PRINTK("All cluster daemons have exited.\n"); +} + +static int peers_not_in_message(int index, int message, int precise) +{ + struct cluster_member *this; + unsigned long flags; + int result = 0; + + spin_lock_irqsave(&node_array[index].member_list_lock, flags); + list_for_each_entry(this, &node_array[index].member_list, list) { + if (this->ignore) + continue; + + PRINTK("Peer %d.%d.%d.%d sending %s. " + "Seeking %s.\n", + NIPQUAD(this->addr), + str_message(this->message), str_message(message)); + if ((precise ? this->message : + this->message & MSG_STATE_MASK) != + message) + result++; + } + spin_unlock_irqrestore(&node_array[index].member_list_lock, flags); + PRINTK("%d peers in sought message.\n", result); + return result; +} + +static void reset_ignored(int index) +{ + struct cluster_member *this; + unsigned long flags; + + spin_lock_irqsave(&node_array[index].member_list_lock, flags); + list_for_each_entry(this, &node_array[index].member_list, list) + this->ignore = 0; + node_array[index].ignored_peer_count = 0; + spin_unlock_irqrestore(&node_array[index].member_list_lock, flags); +} + +static int peers_in_message(int index, int message, int precise) +{ + return node_array[index].peer_count - + node_array[index].ignored_peer_count - + peers_not_in_message(index, message, precise); +} + +static int time_to_continue(int index, unsigned long start, int message) +{ + int first = peers_not_in_message(index, message, 0); + int second = peers_in_message(index, message, 1); + + PRINTK("First part returns %d, second returns %d.\n", first, second); + + if (!first && !second) { + PRINTK("All peers answered message %d.\n", + message); + return 1; + } + + if (time_after(jiffies, start + continue_delay)) { + PRINTK("Timeout reached.\n"); + return 1; + } + + PRINTK("Not time to continue yet (%lu < %lu).\n", jiffies, + start + continue_delay); + return 0; +} + +void toi_initiate_cluster_hibernate(void) +{ + int result; + unsigned long start; + + result = do_toi_step(STEP_HIBERNATE_PREPARE_IMAGE); + if (result) + return; + + toi_send_if(MSG_HIBERNATE, 0); + + start = jiffies; + wait_event(node_array[0].member_events, + time_to_continue(0, start, MSG_HIBERNATE)); + + if (test_action_state(TOI_FREEZER_TEST)) { + toi_send_if(MSG_ABORT, 0); + + start = jiffies; + wait_event(node_array[0].member_events, + time_to_continue(0, start, MSG_RUNNING)); + + do_toi_step(STEP_QUIET_CLEANUP); + return; + } + + toi_send_if(MSG_IO, 0); + + result = do_toi_step(STEP_HIBERNATE_SAVE_IMAGE); + if (result) + return; + + /* This code runs at resume time too! */ + if (toi_in_hibernate) + result = do_toi_step(STEP_HIBERNATE_POWERDOWN); +} + +/* toi_cluster_print_debug_stats + * + * Description: Print information to be recorded for debugging purposes into a + * buffer. + * Arguments: buffer: Pointer to a buffer into which the debug info will be + * printed. + * size: Size of the buffer. + * Returns: Number of characters written to the buffer. + */ +static int toi_cluster_print_debug_stats(char *buffer, int size) +{ + int len; + + if (strlen(toi_cluster_iface)) + len = scnprintf(buffer, size, + "- Cluster interface is '%s'.\n", + toi_cluster_iface); + else + len = scnprintf(buffer, size, + "- Cluster support is disabled.\n"); + return len; +} + +/* cluster_memory_needed + * + * Description: Tell the caller how much memory we need to operate during + * hibernate/resume. + * Returns: Unsigned long. Maximum number of bytes of memory required for + * operation. + */ +static int toi_cluster_memory_needed(void) +{ + return 0; +} + +static int toi_cluster_storage_needed(void) +{ + return 1 + strlen(toi_cluster_iface); +} + +/* toi_cluster_save_config_info + * + * Description: Save informaton needed when reloading the image at resume time. + * Arguments: Buffer: Pointer to a buffer of size PAGE_SIZE. + * Returns: Number of bytes used for saving our data. + */ +static int toi_cluster_save_config_info(char *buffer) +{ + strcpy(buffer, toi_cluster_iface); + return strlen(toi_cluster_iface + 1); +} + +/* toi_cluster_load_config_info + * + * Description: Reload information needed for declustering the image at + * resume time. + * Arguments: Buffer: Pointer to the start of the data. + * Size: Number of bytes that were saved. + */ +static void toi_cluster_load_config_info(char *buffer, int size) +{ + strncpy(toi_cluster_iface, buffer, size); + return; +} + +static void cluster_startup(void) +{ + int have_image = do_check_can_resume(), i; + unsigned long start = jiffies, initial_message; + struct task_struct *p; + + initial_message = MSG_IMAGE; + + have_image = 1; + + for (i = 0; i < num_local_nodes; i++) { + PRINTK("Starting ktoiclusterd %d.\n", i); + p = kthread_create(kTOICluster, (void *) initial_message, + "ktoiclusterd/%d", i); + if (IS_ERR(p)) { + printk(KERN_ERR "Failed to start ktoiclusterd.\n"); + return; + } + + wake_up_process(p); + } + + /* Wait for delay or someone else sending first message */ + wait_event(node_array[0].member_events, time_to_continue(0, start, + MSG_IMAGE)); + + others_have_image = peers_in_message(0, MSG_IMAGE | MSG_ACK, 1); + + printk(KERN_INFO "Continuing. I %shave an image. Peers with image:" + " %d.\n", have_image ? "" : "don't ", others_have_image); + + if (have_image) { + int result; + + /* Start to resume */ + printk(KERN_INFO " === Starting to resume === \n"); + node_array[0].current_message = MSG_IO; + toi_send_if(MSG_IO, 0); + + /* result = do_toi_step(STEP_RESUME_LOAD_PS1); */ + result = 0; + + if (!result) { + /* + * Atomic restore - we'll come back in the hibernation + * path. + */ + + /* result = do_toi_step(STEP_RESUME_DO_RESTORE); */ + result = 0; + + /* do_toi_step(STEP_QUIET_CLEANUP); */ + } + + node_array[0].current_message |= MSG_NACK; + + /* For debugging - disable for real life? */ + wait_event(node_array[0].member_events, + time_to_continue(0, start, MSG_IO)); + } + + if (others_have_image) { + /* Wait for them to resume */ + printk(KERN_INFO "Waiting for other nodes to resume.\n"); + start = jiffies; + wait_event(node_array[0].member_events, + time_to_continue(0, start, MSG_RUNNING)); + if (peers_not_in_message(0, MSG_RUNNING, 0)) + printk(KERN_INFO "Timed out while waiting for other " + "nodes to resume.\n"); + } + + /* Find out whether an image exists here. Send ACK_IMAGE or NACK_IMAGE + * as appropriate. + * + * If we don't have an image: + * - Wait until someone else says they have one, or conditions are met + * for continuing to boot (n machines or t seconds). + * - If anyone has an image, wait for them to resume before continuing + * to boot. + * + * If we have an image: + * - Wait until conditions are met before continuing to resume (n + * machines or t seconds). Send RESUME_PREP and freeze processes. + * NACK_PREP if freezing fails (shouldn't) and follow logic for + * us having no image above. On success, wait for [N]ACK_PREP from + * other machines. Read image (including atomic restore) until done. + * Wait for ACK_READ from others (should never fail). Thaw processes + * and do post-resume. (The section after the atomic restore is done + * via the code for hibernating). + */ + + node_array[0].current_message = MSG_RUNNING; +} + +/* toi_cluster_open_iface + * + * Description: Prepare to use an interface. + */ + +static int toi_cluster_open_iface(void) +{ + struct net_device *dev; + + rtnl_lock(); + + for_each_netdev(&init_net, dev) { + if (/* dev == &init_net.loopback_dev || */ + strcmp(dev->name, toi_cluster_iface)) + continue; + + net_dev = dev; + break; + } + + rtnl_unlock(); + + if (!net_dev) { + printk(KERN_ERR MYNAME ": Device %s not found.\n", + toi_cluster_iface); + return -ENODEV; + } + + dev_add_pack(&toi_cluster_packet_type); + added_pack = 1; + + loopback_mode = (net_dev == init_net.loopback_dev); + num_local_nodes = loopback_mode ? 8 : 1; + + PRINTK("Loopback mode is %s. Number of local nodes is %d.\n", + loopback_mode ? "on" : "off", num_local_nodes); + + cluster_startup(); + return 0; +} + +/* toi_cluster_close_iface + * + * Description: Stop using an interface. + */ + +static int toi_cluster_close_iface(void) +{ + kill_clusterd(); + if (added_pack) { + dev_remove_pack(&toi_cluster_packet_type); + added_pack = 0; + } + return 0; +} + +static void write_side_effect(void) +{ + if (toi_cluster_ops.enabled) { + toi_cluster_open_iface(); + set_toi_state(TOI_CLUSTER_MODE); + } else { + toi_cluster_close_iface(); + clear_toi_state(TOI_CLUSTER_MODE); + } +} + +static void node_write_side_effect(void) +{ +} + +/* + * data for our sysfs entries. + */ +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_STRING("interface", SYSFS_RW, toi_cluster_iface, IFNAMSIZ, 0, + NULL), + SYSFS_INT("enabled", SYSFS_RW, &toi_cluster_ops.enabled, 0, 1, 0, + write_side_effect), + SYSFS_STRING("cluster_name", SYSFS_RW, toi_cluster_key, 32, 0, NULL), + SYSFS_STRING("pre-hibernate-script", SYSFS_RW, pre_hibernate_script, + 256, 0, NULL), + SYSFS_STRING("post-hibernate-script", SYSFS_RW, post_hibernate_script, + 256, 0, STRING), + SYSFS_UL("continue_delay", SYSFS_RW, &continue_delay, HZ / 2, 60 * HZ, + 0) +}; + +/* + * Ops structure. + */ + +static struct toi_module_ops toi_cluster_ops = { + .type = FILTER_MODULE, + .name = "Cluster", + .directory = "cluster", + .module = THIS_MODULE, + .memory_needed = toi_cluster_memory_needed, + .print_debug_info = toi_cluster_print_debug_stats, + .save_config_info = toi_cluster_save_config_info, + .load_config_info = toi_cluster_load_config_info, + .storage_needed = toi_cluster_storage_needed, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ + +#ifdef MODULE +#define INIT static __init +#define EXIT static __exit +#else +#define INIT +#define EXIT +#endif + +INIT int toi_cluster_init(void) +{ + int temp = toi_register_module(&toi_cluster_ops), i; + struct kobject *kobj = toi_cluster_ops.dir_kobj; + + for (i = 0; i < MAX_LOCAL_NODES; i++) { + node_array[i].current_message = 0; + INIT_LIST_HEAD(&node_array[i].member_list); + init_waitqueue_head(&node_array[i].member_events); + spin_lock_init(&node_array[i].member_list_lock); + spin_lock_init(&node_array[i].receive_lock); + + /* Set up sysfs entry */ + node_array[i].sysfs_data.attr.name = toi_kzalloc(8, + sizeof(node_array[i].sysfs_data.attr.name), + GFP_KERNEL); + sprintf((char *) node_array[i].sysfs_data.attr.name, "node_%d", + i); + node_array[i].sysfs_data.attr.mode = SYSFS_RW; + node_array[i].sysfs_data.type = TOI_SYSFS_DATA_INTEGER; + node_array[i].sysfs_data.flags = 0; + node_array[i].sysfs_data.data.integer.variable = + (int *) &node_array[i].current_message; + node_array[i].sysfs_data.data.integer.minimum = 0; + node_array[i].sysfs_data.data.integer.maximum = INT_MAX; + node_array[i].sysfs_data.write_side_effect = + node_write_side_effect; + toi_register_sysfs_file(kobj, &node_array[i].sysfs_data); + } + + toi_cluster_ops.enabled = (strlen(toi_cluster_iface) > 0); + + if (toi_cluster_ops.enabled) + toi_cluster_open_iface(); + + return temp; +} + +EXIT void toi_cluster_exit(void) +{ + int i; + toi_cluster_close_iface(); + + for (i = 0; i < MAX_LOCAL_NODES; i++) + toi_unregister_sysfs_file(toi_cluster_ops.dir_kobj, + &node_array[i].sysfs_data); + toi_unregister_module(&toi_cluster_ops); +} + +static int __init toi_cluster_iface_setup(char *iface) +{ + toi_cluster_ops.enabled = (*iface && + strcmp(iface, "off")); + + if (toi_cluster_ops.enabled) + strncpy(toi_cluster_iface, iface, strlen(iface)); +} + +__setup("toi_cluster=", toi_cluster_iface_setup); diff --git b/kernel/power/tuxonice_cluster.h b/kernel/power/tuxonice_cluster.h new file mode 100644 index 0000000..84356b3 --- /dev/null +++ b/kernel/power/tuxonice_cluster.h @@ -0,0 +1,18 @@ +/* + * kernel/power/tuxonice_cluster.h + * + * Copyright (C) 2006-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + */ + +#ifdef CONFIG_TOI_CLUSTER +extern int toi_cluster_init(void); +extern void toi_cluster_exit(void); +extern void toi_initiate_cluster_hibernate(void); +#else +static inline int toi_cluster_init(void) { return 0; } +static inline void toi_cluster_exit(void) { } +static inline void toi_initiate_cluster_hibernate(void) { } +#endif + diff --git b/kernel/power/tuxonice_compress.c b/kernel/power/tuxonice_compress.c new file mode 100644 index 0000000..84b8522 --- /dev/null +++ b/kernel/power/tuxonice_compress.c @@ -0,0 +1,452 @@ +/* + * kernel/power/compression.c + * + * Copyright (C) 2003-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * This file contains data compression routines for TuxOnIce, + * using cryptoapi. + */ + +#include +#include +#include +#include + +#include "tuxonice_builtin.h" +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_alloc.h" + +static int toi_expected_compression; + +static struct toi_module_ops toi_compression_ops; +static struct toi_module_ops *next_driver; + +static char toi_compressor_name[32] = "lzo"; + +static DEFINE_MUTEX(stats_lock); + +struct cpu_context { + u8 *page_buffer; + struct crypto_comp *transform; + unsigned int len; + u8 *buffer_start; + u8 *output_buffer; +}; + +#define OUT_BUF_SIZE (2 * PAGE_SIZE) + +static DEFINE_PER_CPU(struct cpu_context, contexts); + +/* + * toi_crypto_prepare + * + * Prepare to do some work by allocating buffers and transforms. + */ +static int toi_compress_crypto_prepare(void) +{ + int cpu; + + if (!*toi_compressor_name) { + printk(KERN_INFO "TuxOnIce: Compression enabled but no " + "compressor name set.\n"); + return 1; + } + + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + this->transform = crypto_alloc_comp(toi_compressor_name, 0, 0); + if (IS_ERR(this->transform)) { + printk(KERN_INFO "TuxOnIce: Failed to initialise the " + "%s compression transform.\n", + toi_compressor_name); + this->transform = NULL; + return 1; + } + + this->page_buffer = + (char *) toi_get_zeroed_page(16, TOI_ATOMIC_GFP); + + if (!this->page_buffer) { + printk(KERN_ERR + "Failed to allocate a page buffer for TuxOnIce " + "compression driver.\n"); + return -ENOMEM; + } + + this->output_buffer = + (char *) vmalloc_32(OUT_BUF_SIZE); + + if (!this->output_buffer) { + printk(KERN_ERR + "Failed to allocate a output buffer for TuxOnIce " + "compression driver.\n"); + return -ENOMEM; + } + } + + return 0; +} + +static int toi_compress_rw_cleanup(int writing) +{ + int cpu; + + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + if (this->transform) { + crypto_free_comp(this->transform); + this->transform = NULL; + } + + if (this->page_buffer) + toi_free_page(16, (unsigned long) this->page_buffer); + + this->page_buffer = NULL; + + if (this->output_buffer) + vfree(this->output_buffer); + + this->output_buffer = NULL; + } + + return 0; +} + +/* + * toi_compress_init + */ + +static int toi_compress_init(int toi_or_resume) +{ + if (!toi_or_resume) + return 0; + + toi_compress_bytes_in = 0; + toi_compress_bytes_out = 0; + + next_driver = toi_get_next_filter(&toi_compression_ops); + + return next_driver ? 0 : -ECHILD; +} + +/* + * toi_compress_rw_init() + */ + +static int toi_compress_rw_init(int rw, int stream_number) +{ + if (toi_compress_crypto_prepare()) { + printk(KERN_ERR "Failed to initialise compression " + "algorithm.\n"); + if (rw == READ) { + printk(KERN_INFO "Unable to read the image.\n"); + return -ENODEV; + } else { + printk(KERN_INFO "Continuing without " + "compressing the image.\n"); + toi_compression_ops.enabled = 0; + } + } + + return 0; +} + +/* + * toi_compress_write_page() + * + * Compress a page of data, buffering output and passing on filled + * pages to the next module in the pipeline. + * + * Buffer_page: Pointer to a buffer of size PAGE_SIZE, containing + * data to be compressed. + * + * Returns: 0 on success. Otherwise the error is that returned by later + * modules, -ECHILD if we have a broken pipeline or -EIO if + * zlib errs. + */ +static int toi_compress_write_page(unsigned long index, int buf_type, + void *buffer_page, unsigned int buf_size) +{ + int ret = 0, cpu = smp_processor_id(); + struct cpu_context *ctx = &per_cpu(contexts, cpu); + u8* output_buffer = buffer_page; + int output_len = buf_size; + int out_buf_type = buf_type; + + if (ctx->transform) { + + ctx->buffer_start = TOI_MAP(buf_type, buffer_page); + ctx->len = OUT_BUF_SIZE; + + ret = crypto_comp_compress(ctx->transform, + ctx->buffer_start, buf_size, + ctx->output_buffer, &ctx->len); + + TOI_UNMAP(buf_type, buffer_page); + + toi_message(TOI_COMPRESS, TOI_VERBOSE, 0, + "CPU %d, index %lu: %d bytes", + cpu, index, ctx->len); + + if (!ret && ctx->len < buf_size) { /* some compression */ + output_buffer = ctx->output_buffer; + output_len = ctx->len; + out_buf_type = TOI_VIRT; + } + + } + + mutex_lock(&stats_lock); + + toi_compress_bytes_in += buf_size; + toi_compress_bytes_out += output_len; + + mutex_unlock(&stats_lock); + + if (!ret) + ret = next_driver->write_page(index, out_buf_type, + output_buffer, output_len); + + return ret; +} + +/* + * toi_compress_read_page() + * @buffer_page: struct page *. Pointer to a buffer of size PAGE_SIZE. + * + * Retrieve data from later modules and decompress it until the input buffer + * is filled. + * Zero if successful. Error condition from me or from downstream on failure. + */ +static int toi_compress_read_page(unsigned long *index, int buf_type, + void *buffer_page, unsigned int *buf_size) +{ + int ret, cpu = smp_processor_id(); + unsigned int len; + unsigned int outlen = PAGE_SIZE; + char *buffer_start; + struct cpu_context *ctx = &per_cpu(contexts, cpu); + + if (!ctx->transform) + return next_driver->read_page(index, TOI_PAGE, buffer_page, + buf_size); + + /* + * All our reads must be synchronous - we can't decompress + * data that hasn't been read yet. + */ + + ret = next_driver->read_page(index, TOI_VIRT, ctx->page_buffer, &len); + + buffer_start = kmap(buffer_page); + + /* Error or uncompressed data */ + if (ret || len == PAGE_SIZE) { + memcpy(buffer_start, ctx->page_buffer, len); + goto out; + } + + ret = crypto_comp_decompress( + ctx->transform, + ctx->page_buffer, + len, buffer_start, &outlen); + + toi_message(TOI_COMPRESS, TOI_VERBOSE, 0, + "CPU %d, index %lu: %d=>%d (%d).", + cpu, *index, len, outlen, ret); + + if (ret) + abort_hibernate(TOI_FAILED_IO, + "Compress_read returned %d.\n", ret); + else if (outlen != PAGE_SIZE) { + abort_hibernate(TOI_FAILED_IO, + "Decompression yielded %d bytes instead of %ld.\n", + outlen, PAGE_SIZE); + printk(KERN_ERR "Decompression yielded %d bytes instead of " + "%ld.\n", outlen, PAGE_SIZE); + ret = -EIO; + *buf_size = outlen; + } +out: + TOI_UNMAP(buf_type, buffer_page); + return ret; +} + +/* + * toi_compress_print_debug_stats + * @buffer: Pointer to a buffer into which the debug info will be printed. + * @size: Size of the buffer. + * + * Print information to be recorded for debugging purposes into a buffer. + * Returns: Number of characters written to the buffer. + */ + +static int toi_compress_print_debug_stats(char *buffer, int size) +{ + unsigned long pages_in = toi_compress_bytes_in >> PAGE_SHIFT, + pages_out = toi_compress_bytes_out >> PAGE_SHIFT; + int len; + + /* Output the compression ratio achieved. */ + if (*toi_compressor_name) + len = scnprintf(buffer, size, "- Compressor is '%s'.\n", + toi_compressor_name); + else + len = scnprintf(buffer, size, "- Compressor is not set.\n"); + + if (pages_in) + len += scnprintf(buffer+len, size - len, " Compressed " + "%lu bytes into %lu (%ld percent compression).\n", + toi_compress_bytes_in, + toi_compress_bytes_out, + (pages_in - pages_out) * 100 / pages_in); + return len; +} + +/* + * toi_compress_compression_memory_needed + * + * Tell the caller how much memory we need to operate during hibernate/resume. + * Returns: Unsigned long. Maximum number of bytes of memory required for + * operation. + */ +static int toi_compress_memory_needed(void) +{ + return 2 * PAGE_SIZE; +} + +static int toi_compress_storage_needed(void) +{ + return 2 * sizeof(unsigned long) + 2 * sizeof(int) + + strlen(toi_compressor_name) + 1; +} + +/* + * toi_compress_save_config_info + * @buffer: Pointer to a buffer of size PAGE_SIZE. + * + * Save informaton needed when reloading the image at resume time. + * Returns: Number of bytes used for saving our data. + */ +static int toi_compress_save_config_info(char *buffer) +{ + int len = strlen(toi_compressor_name) + 1, offset = 0; + + *((unsigned long *) buffer) = toi_compress_bytes_in; + offset += sizeof(unsigned long); + *((unsigned long *) (buffer + offset)) = toi_compress_bytes_out; + offset += sizeof(unsigned long); + *((int *) (buffer + offset)) = toi_expected_compression; + offset += sizeof(int); + *((int *) (buffer + offset)) = len; + offset += sizeof(int); + strncpy(buffer + offset, toi_compressor_name, len); + return offset + len; +} + +/* toi_compress_load_config_info + * @buffer: Pointer to the start of the data. + * @size: Number of bytes that were saved. + * + * Description: Reload information needed for decompressing the image at + * resume time. + */ +static void toi_compress_load_config_info(char *buffer, int size) +{ + int len, offset = 0; + + toi_compress_bytes_in = *((unsigned long *) buffer); + offset += sizeof(unsigned long); + toi_compress_bytes_out = *((unsigned long *) (buffer + offset)); + offset += sizeof(unsigned long); + toi_expected_compression = *((int *) (buffer + offset)); + offset += sizeof(int); + len = *((int *) (buffer + offset)); + offset += sizeof(int); + strncpy(toi_compressor_name, buffer + offset, len); +} + +static void toi_compress_pre_atomic_restore(struct toi_boot_kernel_data *bkd) +{ + bkd->compress_bytes_in = toi_compress_bytes_in; + bkd->compress_bytes_out = toi_compress_bytes_out; +} + +static void toi_compress_post_atomic_restore(struct toi_boot_kernel_data *bkd) +{ + toi_compress_bytes_in = bkd->compress_bytes_in; + toi_compress_bytes_out = bkd->compress_bytes_out; +} + +/* + * toi_expected_compression_ratio + * + * Description: Returns the expected ratio between data passed into this module + * and the amount of data output when writing. + * Returns: 100 if the module is disabled. Otherwise the value set by the + * user via our sysfs entry. + */ + +static int toi_compress_expected_ratio(void) +{ + if (!toi_compression_ops.enabled) + return 100; + else + return 100 - toi_expected_compression; +} + +/* + * data for our sysfs entries. + */ +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_INT("expected_compression", SYSFS_RW, &toi_expected_compression, + 0, 99, 0, NULL), + SYSFS_INT("enabled", SYSFS_RW, &toi_compression_ops.enabled, 0, 1, 0, + NULL), + SYSFS_STRING("algorithm", SYSFS_RW, toi_compressor_name, 31, 0, NULL), +}; + +/* + * Ops structure. + */ +static struct toi_module_ops toi_compression_ops = { + .type = FILTER_MODULE, + .name = "compression", + .directory = "compression", + .module = THIS_MODULE, + .initialise = toi_compress_init, + .memory_needed = toi_compress_memory_needed, + .print_debug_info = toi_compress_print_debug_stats, + .save_config_info = toi_compress_save_config_info, + .load_config_info = toi_compress_load_config_info, + .storage_needed = toi_compress_storage_needed, + .expected_compression = toi_compress_expected_ratio, + + .pre_atomic_restore = toi_compress_pre_atomic_restore, + .post_atomic_restore = toi_compress_post_atomic_restore, + + .rw_init = toi_compress_rw_init, + .rw_cleanup = toi_compress_rw_cleanup, + + .write_page = toi_compress_write_page, + .read_page = toi_compress_read_page, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ + +static __init int toi_compress_load(void) +{ + return toi_register_module(&toi_compression_ops); +} + +late_initcall(toi_compress_load); diff --git b/kernel/power/tuxonice_copy_before_write.c b/kernel/power/tuxonice_copy_before_write.c new file mode 100644 index 0000000..eb62791 --- /dev/null +++ b/kernel/power/tuxonice_copy_before_write.c @@ -0,0 +1,240 @@ +/* + * kernel/power/tuxonice_copy_before_write.c + * + * Copyright (C) 2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Routines (apart from the fault handling code) to deal with allocating memory + * for copying pages before they are modified, restoring the contents and getting + * the contents written to disk. + */ + +#include +#include +#include +#include "tuxonice_alloc.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice.h" + +DEFINE_PER_CPU(struct toi_cbw_state, toi_cbw_states); +#define CBWS_PER_PAGE (PAGE_SIZE / sizeof(struct toi_cbw)) +#define toi_cbw_pool_size 100 + +static void _toi_free_cbw_data(struct toi_cbw_state *state) +{ + struct toi_cbw *page_ptr, *ptr, *next; + + page_ptr = ptr = state->first; + + while(ptr) { + next = ptr->next; + + if (ptr->virt) { + toi__free_page(40, virt_to_page(ptr->virt)); + } + if ((((unsigned long) ptr) & PAGE_MASK) != (unsigned long) page_ptr) { + /* Must be on a new page - free the previous one. */ + toi__free_page(40, virt_to_page(page_ptr)); + page_ptr = ptr; + } + ptr = next; + } + + if (page_ptr) { + toi__free_page(40, virt_to_page(page_ptr)); + } + + state->first = state->next = state->last = NULL; + state->size = 0; +} + +void toi_free_cbw_data(void) +{ + int i; + + for_each_online_cpu(i) { + struct toi_cbw_state *state = &per_cpu(toi_cbw_states, i); + + if (!state->first) + continue; + + state->enabled = 0; + + while (state->active) { + schedule(); + } + + _toi_free_cbw_data(state); + } +} + +static int _toi_allocate_cbw_data(struct toi_cbw_state *state) +{ + while(state->size < toi_cbw_pool_size) { + int i; + struct toi_cbw *ptr; + + ptr = (struct toi_cbw *) toi_get_zeroed_page(40, GFP_KERNEL); + + if (!ptr) { + return -ENOMEM; + } + + if (!state->first) { + state->first = state->next = state->last = ptr; + } + + for (i = 0; i < CBWS_PER_PAGE; i++) { + struct toi_cbw *cbw = &ptr[i]; + + cbw->virt = (char *) toi_get_zeroed_page(40, GFP_KERNEL); + if (!cbw->virt) { + state->size += i; + printk("Out of memory allocating CBW pages.\n"); + return -ENOMEM; + } + + if (cbw == state->first) + continue; + + state->last->next = cbw; + state->last = cbw; + } + + state->size += CBWS_PER_PAGE; + } + + state->enabled = 1; + + return 0; +} + + +int toi_allocate_cbw_data(void) +{ + int i, result; + + for_each_online_cpu(i) { + struct toi_cbw_state *state = &per_cpu(toi_cbw_states, i); + + result = _toi_allocate_cbw_data(state); + + if (result) + return result; + } + + return 0; +} + +void toi_cbw_restore(void) +{ + if (!toi_keeping_image) + return; + +} + +void toi_cbw_write(void) +{ + if (!toi_keeping_image) + return; + +} + +/** + * toi_cbw_test_read - Test copy before write on one page + * + * Allocate copy before write buffers, then make one page only copy-before-write + * and attempt to write to it. We should then be able to retrieve the original + * version from the cbw buffer and the modified version from the page itself. + */ +static int toi_cbw_test_read(const char *buffer, int count) +{ + unsigned long virt = toi_get_zeroed_page(40, GFP_KERNEL); + char *original = "Original contents"; + char *modified = "Modified material"; + struct page *page = virt_to_page(virt); + int i, len = 0, found = 0, pfn = page_to_pfn(page); + + if (!page) { + printk("toi_cbw_test_read: Unable to allocate a page for testing.\n"); + return -ENOMEM; + } + + memcpy((char *) virt, original, strlen(original)); + + if (toi_allocate_cbw_data()) { + printk("toi_cbw_test_read: Unable to allocate cbw data.\n"); + return -ENOMEM; + } + + toi_reset_dirtiness_one(pfn, 0); + + SetPageTOI_CBW(page); + + memcpy((char *) virt, modified, strlen(modified)); + + if (strncmp((char *) virt, modified, strlen(modified))) { + len += sprintf((char *) buffer + len, "Failed to write to page after protecting it.\n"); + } + + for_each_online_cpu(i) { + struct toi_cbw_state *state = &per_cpu(toi_cbw_states, i); + struct toi_cbw *ptr = state->first, *last_ptr = ptr; + + if (!found) { + while (ptr) { + if (ptr->pfn == pfn) { + found = 1; + if (strncmp(ptr->virt, original, strlen(original))) { + len += sprintf((char *) buffer + len, "Contents of original buffer are not original.\n"); + } else { + len += sprintf((char *) buffer + len, "Test passed. Buffer changed and original contents preserved.\n"); + } + break; + } + + last_ptr = ptr; + ptr = ptr->next; + } + } + + if (!last_ptr) + len += sprintf((char *) buffer + len, "All available CBW buffers on cpu %d used.\n", i); + } + + if (!found) + len += sprintf((char *) buffer + len, "Copy before write buffer not found.\n"); + + toi_free_cbw_data(); + + return len; +} + +/* + * This array contains entries that are automatically registered at + * boot. Modules and the console code register their own entries separately. + */ +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_CUSTOM("test", SYSFS_RW, toi_cbw_test_read, + NULL, SYSFS_NEEDS_SM_FOR_READ, NULL), +}; + +static struct toi_module_ops toi_cbw_ops = { + .type = MISC_HIDDEN_MODULE, + .name = "copy_before_write debugging", + .directory = "cbw", + .module = THIS_MODULE, + .early = 1, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +int toi_cbw_init(void) +{ + int result = toi_register_module(&toi_cbw_ops); + return result; +} diff --git b/kernel/power/tuxonice_extent.c b/kernel/power/tuxonice_extent.c new file mode 100644 index 0000000..522c836 --- /dev/null +++ b/kernel/power/tuxonice_extent.c @@ -0,0 +1,144 @@ +/* + * kernel/power/tuxonice_extent.c + * + * Copyright (C) 2003-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + * These functions encapsulate the manipulation of storage metadata. + */ + +#include +#include "tuxonice_modules.h" +#include "tuxonice_extent.h" +#include "tuxonice_alloc.h" +#include "tuxonice_ui.h" +#include "tuxonice.h" + +/** + * toi_get_extent - return a free extent + * + * May fail, returning NULL instead. + **/ +static struct hibernate_extent *toi_get_extent(void) +{ + return (struct hibernate_extent *) toi_kzalloc(2, + sizeof(struct hibernate_extent), TOI_ATOMIC_GFP); +} + +/** + * toi_put_extent_chain - free a chain of extents starting from value 'from' + * @chain: Chain to free. + * + * Note that 'from' is an extent value, and may be part way through an extent. + * In this case, the extent should be truncated (if necessary) and following + * extents freed. + **/ +void toi_put_extent_chain_from(struct hibernate_extent_chain *chain, unsigned long from) +{ + struct hibernate_extent *this; + + this = chain->first; + + while (this) { + struct hibernate_extent *next = this->next; + + // Delete the whole extent? + if (this->start >= from) { + chain->size -= (this->end - this->start + 1); + if (chain->first == this) + chain->first = next; + if (chain->last_touched == this) + chain->last_touched = NULL; + if (chain->current_extent == this) + chain->current_extent = NULL; + toi_kfree(2, this, sizeof(*this)); + chain->num_extents--; + } else if (this->end >= from) { + // Delete part of the extent + chain->size -= (this->end - from + 1); + this->start = from; + } + this = next; + } +} + +/** + * toi_put_extent_chain - free a whole chain of extents + * @chain: Chain to free. + **/ +void toi_put_extent_chain(struct hibernate_extent_chain *chain) +{ + toi_put_extent_chain_from(chain, 0); +} + +/** + * toi_add_to_extent_chain - add an extent to an existing chain + * @chain: Chain to which the extend should be added + * @start: Start of the extent (first physical block) + * @end: End of the extent (last physical block) + * + * The chain information is updated if the insertion is successful. + **/ +int toi_add_to_extent_chain(struct hibernate_extent_chain *chain, + unsigned long start, unsigned long end) +{ + struct hibernate_extent *new_ext = NULL, *cur_ext = NULL; + + toi_message(TOI_IO, TOI_VERBOSE, 0, + "Adding extent %lu-%lu to chain %p.\n", start, end, chain); + + /* Find the right place in the chain */ + if (chain->last_touched && chain->last_touched->start < start) + cur_ext = chain->last_touched; + else if (chain->first && chain->first->start < start) + cur_ext = chain->first; + + if (cur_ext) { + while (cur_ext->next && cur_ext->next->start < start) + cur_ext = cur_ext->next; + + if (cur_ext->end == (start - 1)) { + struct hibernate_extent *next_ext = cur_ext->next; + cur_ext->end = end; + + /* Merge with the following one? */ + if (next_ext && cur_ext->end + 1 == next_ext->start) { + cur_ext->end = next_ext->end; + cur_ext->next = next_ext->next; + toi_kfree(2, next_ext, sizeof(*next_ext)); + chain->num_extents--; + } + + chain->last_touched = cur_ext; + chain->size += (end - start + 1); + + return 0; + } + } + + new_ext = toi_get_extent(); + if (!new_ext) { + printk(KERN_INFO "Error unable to append a new extent to the " + "chain.\n"); + return -ENOMEM; + } + + chain->num_extents++; + chain->size += (end - start + 1); + new_ext->start = start; + new_ext->end = end; + + chain->last_touched = new_ext; + + if (cur_ext) { + new_ext->next = cur_ext->next; + cur_ext->next = new_ext; + } else { + if (chain->first) + new_ext->next = chain->first; + chain->first = new_ext; + } + + return 0; +} diff --git b/kernel/power/tuxonice_extent.h b/kernel/power/tuxonice_extent.h new file mode 100644 index 0000000..aeccf1f --- /dev/null +++ b/kernel/power/tuxonice_extent.h @@ -0,0 +1,45 @@ +/* + * kernel/power/tuxonice_extent.h + * + * Copyright (C) 2003-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * It contains declarations related to extents. Extents are + * TuxOnIce's method of storing some of the metadata for the image. + * See tuxonice_extent.c for more info. + * + */ + +#include "tuxonice_modules.h" + +#ifndef EXTENT_H +#define EXTENT_H + +struct hibernate_extent { + unsigned long start, end; + struct hibernate_extent *next; +}; + +struct hibernate_extent_chain { + unsigned long size; /* size of the chain ie sum (max-min+1) */ + int num_extents; + struct hibernate_extent *first, *last_touched; + struct hibernate_extent *current_extent; + unsigned long current_offset; +}; + +/* Simplify iterating through all the values in an extent chain */ +#define toi_extent_for_each(extent_chain, extentpointer, value) \ +if ((extent_chain)->first) \ + for ((extentpointer) = (extent_chain)->first, (value) = \ + (extentpointer)->start; \ + ((extentpointer) && ((extentpointer)->next || (value) <= \ + (extentpointer)->end)); \ + (((value) == (extentpointer)->end) ? \ + ((extentpointer) = (extentpointer)->next, (value) = \ + ((extentpointer) ? (extentpointer)->start : 0)) : \ + (value)++)) + +extern void toi_put_extent_chain_from(struct hibernate_extent_chain *chain, unsigned long from); +#endif diff --git b/kernel/power/tuxonice_file.c b/kernel/power/tuxonice_file.c new file mode 100644 index 0000000..baf1912 --- /dev/null +++ b/kernel/power/tuxonice_file.c @@ -0,0 +1,484 @@ +/* + * kernel/power/tuxonice_file.c + * + * Copyright (C) 2005-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + * This file encapsulates functions for usage of a simple file as a + * backing store. It is based upon the swapallocator, and shares the + * same basic working. Here, though, we have nothing to do with + * swapspace, and only one device to worry about. + * + * The user can just + * + * echo TuxOnIce > /path/to/my_file + * + * dd if=/dev/zero bs=1M count= >> /path/to/my_file + * + * and + * + * echo /path/to/my_file > /sys/power/tuxonice/file/target + * + * then put what they find in /sys/power/tuxonice/resume + * as their resume= parameter in lilo.conf (and rerun lilo if using it). + * + * Having done this, they're ready to hibernate and resume. + * + * TODO: + * - File resizing. + */ + +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_bio.h" +#include "tuxonice_alloc.h" +#include "tuxonice_builtin.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_ui.h" +#include "tuxonice_io.h" + +#define target_is_normal_file() (S_ISREG(target_inode->i_mode)) + +static struct toi_module_ops toi_fileops; + +static struct file *target_file; +static struct block_device *toi_file_target_bdev; +static unsigned long pages_available, pages_allocated; +static char toi_file_target[256]; +static struct inode *target_inode; +static int file_target_priority; +static int used_devt; +static int target_claim; +static dev_t toi_file_dev_t; +static int sig_page_index; + +/* For test_toi_file_target */ +static struct toi_bdev_info *file_chain; + +static int has_contiguous_blocks(struct toi_bdev_info *dev_info, int page_num) +{ + int j; + sector_t last = 0; + + for (j = 0; j < dev_info->blocks_per_page; j++) { + sector_t this = bmap(target_inode, + page_num * dev_info->blocks_per_page + j); + + if (!this || (last && (last + 1) != this)) + break; + + last = this; + } + + return j == dev_info->blocks_per_page; +} + +static unsigned long get_usable_pages(struct toi_bdev_info *dev_info) +{ + unsigned long result = 0; + struct block_device *bdev = dev_info->bdev; + int i; + + switch (target_inode->i_mode & S_IFMT) { + case S_IFSOCK: + case S_IFCHR: + case S_IFIFO: /* Socket, Char, Fifo */ + return -1; + case S_IFREG: /* Regular file: current size - holes + free + space on part */ + for (i = 0; i < (target_inode->i_size >> PAGE_SHIFT) ; i++) { + if (has_contiguous_blocks(dev_info, i)) + result++; + } + break; + case S_IFBLK: /* Block device */ + if (!bdev->bd_disk) { + toi_message(TOI_IO, TOI_VERBOSE, 0, + "bdev->bd_disk null."); + return 0; + } + + result = (bdev->bd_part ? + bdev->bd_part->nr_sects : + get_capacity(bdev->bd_disk)) >> (PAGE_SHIFT - 9); + } + + + return result; +} + +static int toi_file_register_storage(void) +{ + struct toi_bdev_info *devinfo; + int result = 0; + struct fs_info *fs_info; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "toi_file_register_storage."); + if (!strlen(toi_file_target)) { + toi_message(TOI_IO, TOI_VERBOSE, 0, "Register file storage: " + "No target filename set."); + return 0; + } + + target_file = filp_open(toi_file_target, O_RDONLY|O_LARGEFILE, 0); + toi_message(TOI_IO, TOI_VERBOSE, 0, "filp_open %s returned %p.", + toi_file_target, target_file); + + if (IS_ERR(target_file) || !target_file) { + target_file = NULL; + toi_file_dev_t = name_to_dev_t(toi_file_target); + if (!toi_file_dev_t) { + struct kstat stat; + int error = vfs_stat(toi_file_target, &stat); + printk(KERN_INFO "Open file %s returned %p and " + "name_to_devt failed.\n", + toi_file_target, target_file); + if (error) { + printk(KERN_INFO "Stating the file also failed." + " Nothing more we can do.\n"); + return 0; + } else + toi_file_dev_t = stat.rdev; + } + + toi_file_target_bdev = toi_open_by_devnum(toi_file_dev_t); + if (IS_ERR(toi_file_target_bdev)) { + printk(KERN_INFO "Got a dev_num (%lx) but failed to " + "open it.\n", + (unsigned long) toi_file_dev_t); + toi_file_target_bdev = NULL; + return 0; + } + used_devt = 1; + target_inode = toi_file_target_bdev->bd_inode; + } else + target_inode = target_file->f_mapping->host; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Succeeded in opening the target."); + if (S_ISLNK(target_inode->i_mode) || S_ISDIR(target_inode->i_mode) || + S_ISSOCK(target_inode->i_mode) || S_ISFIFO(target_inode->i_mode)) { + printk(KERN_INFO "File support works with regular files," + " character files and block devices.\n"); + /* Cleanup routine will undo the above */ + return 0; + } + + if (!used_devt) { + if (S_ISBLK(target_inode->i_mode)) { + toi_file_target_bdev = I_BDEV(target_inode); + if (!blkdev_get(toi_file_target_bdev, FMODE_WRITE | + FMODE_READ, NULL)) + target_claim = 1; + } else + toi_file_target_bdev = target_inode->i_sb->s_bdev; + if (!toi_file_target_bdev) { + printk(KERN_INFO "%s is not a valid file allocator " + "target.\n", toi_file_target); + return 0; + } + toi_file_dev_t = toi_file_target_bdev->bd_dev; + } + + devinfo = toi_kzalloc(39, sizeof(struct toi_bdev_info), GFP_ATOMIC); + if (!devinfo) { + printk("Failed to allocate a toi_bdev_info struct for the file allocator.\n"); + return -ENOMEM; + } + + devinfo->bdev = toi_file_target_bdev; + devinfo->allocator = &toi_fileops; + devinfo->allocator_index = 0; + + fs_info = fs_info_from_block_dev(toi_file_target_bdev); + if (fs_info && !IS_ERR(fs_info)) { + memcpy(devinfo->uuid, &fs_info->uuid, 16); + free_fs_info(fs_info); + } else + result = (int) PTR_ERR(fs_info); + + /* Unlike swap code, only complain if fs_info_from_block_dev returned + * -ENOMEM. The 'file' might be a full partition, so might validly not + * have an identifiable type, UUID etc. + */ + if (result) + printk(KERN_DEBUG "Failed to get fs_info for file device (%d).\n", + result); + devinfo->dev_t = toi_file_dev_t; + devinfo->prio = file_target_priority; + devinfo->bmap_shift = target_inode->i_blkbits - 9; + devinfo->blocks_per_page = + (1 << (PAGE_SHIFT - target_inode->i_blkbits)); + sprintf(devinfo->name, "file %s", toi_file_target); + file_chain = devinfo; + toi_message(TOI_IO, TOI_VERBOSE, 0, "Dev_t is %lx. Prio is %d. Bmap " + "shift is %d. Blocks per page %d.", + devinfo->dev_t, devinfo->prio, devinfo->bmap_shift, + devinfo->blocks_per_page); + + /* Keep one aside for the signature */ + pages_available = get_usable_pages(devinfo) - 1; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Registering file storage, %lu " + "pages.", pages_available); + + toi_bio_ops.register_storage(devinfo); + return 0; +} + +static unsigned long toi_file_storage_available(void) +{ + return pages_available; +} + +static int toi_file_allocate_storage(struct toi_bdev_info *chain, + unsigned long request) +{ + unsigned long available = pages_available - pages_allocated; + unsigned long to_add = min(available, request); + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Pages available is %lu. Allocated " + "is %lu. Allocating %lu pages from file.", + pages_available, pages_allocated, to_add); + pages_allocated += to_add; + + return to_add; +} + +/** + * __populate_block_list - add an extent to the chain + * @min: Start of the extent (first physical block = sector) + * @max: End of the extent (last physical block = sector) + * + * If TOI_TEST_BIO is set, print a debug message, outputting the min and max + * fs block numbers. + **/ +static int __populate_block_list(struct toi_bdev_info *chain, int min, int max) +{ + if (test_action_state(TOI_TEST_BIO)) + toi_message(TOI_IO, TOI_VERBOSE, 0, "Adding extent %d-%d.", + min << chain->bmap_shift, + ((max + 1) << chain->bmap_shift) - 1); + + return toi_add_to_extent_chain(&chain->blocks, min, max); +} + +static int get_main_pool_phys_params(struct toi_bdev_info *chain) +{ + int i, extent_min = -1, extent_max = -1, result = 0, have_sig_page = 0; + unsigned long pages_mapped = 0; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Getting file allocator blocks."); + + if (chain->blocks.first) + toi_put_extent_chain(&chain->blocks); + + if (!target_is_normal_file()) { + result = (pages_available > 0) ? + __populate_block_list(chain, chain->blocks_per_page, + (pages_allocated + 1) * + chain->blocks_per_page - 1) : 0; + return result; + } + + /* + * FIXME: We are assuming the first page is contiguous. Is that + * assumption always right? + */ + + for (i = 0; i < (target_inode->i_size >> PAGE_SHIFT); i++) { + sector_t new_sector; + + if (!has_contiguous_blocks(chain, i)) + continue; + + if (!have_sig_page) { + have_sig_page = 1; + sig_page_index = i; + continue; + } + + pages_mapped++; + + /* Ignore first page - it has the header */ + if (pages_mapped == 1) + continue; + + new_sector = bmap(target_inode, (i * chain->blocks_per_page)); + + /* + * I'd love to be able to fill in holes and resize + * files, but not yet... + */ + + if (new_sector == extent_max + 1) + extent_max += chain->blocks_per_page; + else { + if (extent_min > -1) { + result = __populate_block_list(chain, + extent_min, extent_max); + if (result) + return result; + } + + extent_min = new_sector; + extent_max = extent_min + + chain->blocks_per_page - 1; + } + + if (pages_mapped == pages_allocated) + break; + } + + if (extent_min > -1) { + result = __populate_block_list(chain, extent_min, extent_max); + if (result) + return result; + } + + return 0; +} + +static void toi_file_free_storage(struct toi_bdev_info *chain) +{ + pages_allocated = 0; + file_chain = NULL; +} + +/** + * toi_file_print_debug_stats - print debug info + * @buffer: Buffer to data to populate + * @size: Size of the buffer + **/ +static int toi_file_print_debug_stats(char *buffer, int size) +{ + int len = scnprintf(buffer, size, "- File Allocator active.\n"); + + len += scnprintf(buffer+len, size-len, " Storage available for " + "image: %lu pages.\n", pages_available); + + return len; +} + +static void toi_file_cleanup(int finishing_cycle) +{ + if (toi_file_target_bdev) { + if (target_claim) { + blkdev_put(toi_file_target_bdev, FMODE_WRITE | FMODE_READ); + target_claim = 0; + } + + if (used_devt) { + blkdev_put(toi_file_target_bdev, + FMODE_READ | FMODE_NDELAY); + used_devt = 0; + } + toi_file_target_bdev = NULL; + target_inode = NULL; + } + + if (target_file) { + filp_close(target_file, NULL); + target_file = NULL; + } + + pages_available = 0; +} + +/** + * test_toi_file_target - sysfs callback for /sys/power/tuxonince/file/target + * + * Test wheter the target file is valid for hibernating. + **/ +static void test_toi_file_target(void) +{ + int result = toi_file_register_storage(); + sector_t sector; + char buf[50]; + struct fs_info *fs_info; + + if (result || !file_chain) + return; + + /* This doesn't mean we're in business. Is any storage available? */ + if (!pages_available) + goto out; + + toi_file_allocate_storage(file_chain, 1); + result = get_main_pool_phys_params(file_chain); + if (result) + goto out; + + + sector = bmap(target_inode, sig_page_index * + file_chain->blocks_per_page) << file_chain->bmap_shift; + + /* Use the uuid, or the dev_t if that fails */ + fs_info = fs_info_from_block_dev(toi_file_target_bdev); + if (!fs_info || IS_ERR(fs_info)) { + bdevname(toi_file_target_bdev, buf); + sprintf(resume_file, "/dev/%s:%llu", buf, + (unsigned long long) sector); + } else { + int i; + hex_dump_to_buffer(fs_info->uuid, 16, 32, 1, buf, 50, 0); + + /* Remove the spaces */ + for (i = 1; i < 16; i++) { + buf[2 * i] = buf[3 * i]; + buf[2 * i + 1] = buf[3 * i + 1]; + } + buf[32] = 0; + sprintf(resume_file, "UUID=%s:0x%llx", buf, + (unsigned long long) sector); + free_fs_info(fs_info); + } + + toi_attempt_to_parse_resume_device(0); +out: + toi_file_free_storage(file_chain); + toi_bio_ops.free_storage(); +} + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_STRING("target", SYSFS_RW, toi_file_target, 256, + SYSFS_NEEDS_SM_FOR_WRITE, test_toi_file_target), + SYSFS_INT("enabled", SYSFS_RW, &toi_fileops.enabled, 0, 1, 0, NULL), + SYSFS_INT("priority", SYSFS_RW, &file_target_priority, -4095, + 4096, 0, NULL), +}; + +static struct toi_bio_allocator_ops toi_bio_fileops = { + .register_storage = toi_file_register_storage, + .storage_available = toi_file_storage_available, + .allocate_storage = toi_file_allocate_storage, + .bmap = get_main_pool_phys_params, + .free_storage = toi_file_free_storage, +}; + +static struct toi_module_ops toi_fileops = { + .type = BIO_ALLOCATOR_MODULE, + .name = "file storage", + .directory = "file", + .module = THIS_MODULE, + .print_debug_info = toi_file_print_debug_stats, + .cleanup = toi_file_cleanup, + .bio_allocator_ops = &toi_bio_fileops, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ +static __init int toi_file_load(void) +{ + return toi_register_module(&toi_fileops); +} + +late_initcall(toi_file_load); diff --git b/kernel/power/tuxonice_highlevel.c b/kernel/power/tuxonice_highlevel.c new file mode 100644 index 0000000..16cf14c --- /dev/null +++ b/kernel/power/tuxonice_highlevel.c @@ -0,0 +1,1413 @@ +/* + * kernel/power/tuxonice_highlevel.c + */ +/** \mainpage TuxOnIce. + * + * TuxOnIce provides support for saving and restoring an image of + * system memory to an arbitrary storage device, either on the local computer, + * or across some network. The support is entirely OS based, so TuxOnIce + * works without requiring BIOS, APM or ACPI support. The vast majority of the + * code is also architecture independant, so it should be very easy to port + * the code to new architectures. TuxOnIce includes support for SMP, 4G HighMem + * and preemption. Initramfses and initrds are also supported. + * + * TuxOnIce uses a modular design, in which the method of storing the image is + * completely abstracted from the core code, as are transformations on the data + * such as compression and/or encryption (multiple 'modules' can be used to + * provide arbitrary combinations of functionality). The user interface is also + * modular, so that arbitrarily simple or complex interfaces can be used to + * provide anything from debugging information through to eye candy. + * + * \section Copyright + * + * TuxOnIce is released under the GPLv2. + * + * Copyright (C) 1998-2001 Gabor Kuti
+ * Copyright (C) 1998,2001,2002 Pavel Machek
+ * Copyright (C) 2002-2003 Florent Chabaud
+ * Copyright (C) 2002-2015 Nigel Cunningham (nigel at nigelcunningham com au)
+ * + * \section Credits + * + * Nigel would like to thank the following people for their work: + * + * Bernard Blackham
+ * Web page & Wiki administration, some coding. A person without whom + * TuxOnIce would not be where it is. + * + * Michael Frank
+ * Extensive testing and help with improving stability. I was constantly + * amazed by the quality and quantity of Michael's help. + * + * Pavel Machek
+ * Modifications, defectiveness pointing, being with Gabor at the very + * beginning, suspend to swap space, stop all tasks. Port to 2.4.18-ac and + * 2.5.17. Even though Pavel and I disagree on the direction suspend to + * disk should take, I appreciate the valuable work he did in helping Gabor + * get the concept working. + * + * ..and of course the myriads of TuxOnIce users who have helped diagnose + * and fix bugs, made suggestions on how to improve the code, proofread + * documentation, and donated time and money. + * + * Thanks also to corporate sponsors: + * + * Redhat.Sometime employer from May 2006 (my fault, not Redhat's!). + * + * Cyclades.com. Nigel's employers from Dec 2004 until May 2006, who + * allowed him to work on TuxOnIce and PM related issues on company time. + * + * LinuxFund.org. Sponsored Nigel's work on TuxOnIce for four months Oct + * 2003 to Jan 2004. + * + * LAC Linux. Donated P4 hardware that enabled development and ongoing + * maintenance of SMP and Highmem support. + * + * OSDL. Provided access to various hardware configurations, make + * occasional small donations to the project. + */ + +#include +#include +#include +#include +#include +#include +#include +#include /* for get/set_fs & KERNEL_DS on i386 */ +#include +#include + +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_power_off.h" +#include "tuxonice_storage.h" +#include "tuxonice_checksum.h" +#include "tuxonice_builtin.h" +#include "tuxonice_atomic_copy.h" +#include "tuxonice_alloc.h" +#include "tuxonice_cluster.h" + +/*! Pageset metadata. */ +struct pagedir pagedir2 = {2}; + +static mm_segment_t oldfs; +static DEFINE_MUTEX(tuxonice_in_use); +static int block_dump_save; + +int toi_trace_index; + +/* Binary signature if an image is present */ +char tuxonice_signature[9] = "\xed\xc3\x02\xe9\x98\x56\xe5\x0c"; + +unsigned long boot_kernel_data_buffer; + +static char *result_strings[] = { + "Hibernation was aborted", + "The user requested that we cancel the hibernation", + "No storage was available", + "Insufficient storage was available", + "Freezing filesystems and/or tasks failed", + "A pre-existing image was used", + "We would free memory, but image size limit doesn't allow this", + "Unable to free enough memory to hibernate", + "Unable to obtain the Power Management Semaphore", + "A device suspend/resume returned an error", + "A system device suspend/resume returned an error", + "The extra pages allowance is too small", + "We were unable to successfully prepare an image", + "TuxOnIce module initialisation failed", + "TuxOnIce module cleanup failed", + "I/O errors were encountered", + "Ran out of memory", + "An error was encountered while reading the image", + "Platform preparation failed", + "CPU Hotplugging failed", + "Architecture specific preparation failed", + "Pages needed resaving, but we were told to abort if this happens", + "We can't hibernate at the moment (invalid resume= or filewriter " + "target?)", + "A hibernation preparation notifier chain member cancelled the " + "hibernation", + "Pre-snapshot preparation failed", + "Pre-restore preparation failed", + "Failed to disable usermode helpers", + "Can't resume from alternate image", + "Header reservation too small", + "Device Power Management Preparation failed", +}; + +/** + * toi_finish_anything - cleanup after doing anything + * @hibernate_or_resume: Whether finishing a cycle or attempt at + * resuming. + * + * This is our basic clean-up routine, matching start_anything below. We + * call cleanup routines, drop module references and restore process fs and + * cpus allowed masks, together with the global block_dump variable's value. + **/ +void toi_finish_anything(int hibernate_or_resume) +{ + toi_running = 0; + toi_cleanup_modules(hibernate_or_resume); + toi_put_modules(); + if (hibernate_or_resume) { + block_dump = block_dump_save; + set_cpus_allowed_ptr(current, cpu_all_mask); + toi_alloc_print_debug_stats(); + atomic_inc(&snapshot_device_available); + unlock_system_sleep(); + } + + set_fs(oldfs); + mutex_unlock(&tuxonice_in_use); +} + +/** + * toi_start_anything - basic initialisation for TuxOnIce + * @toi_or_resume: Whether starting a cycle or attempt at resuming. + * + * Our basic initialisation routine. Take references on modules, use the + * kernel segment, recheck resume= if no active allocator is set, initialise + * modules, save and reset block_dump and ensure we're running on CPU0. + **/ +int toi_start_anything(int hibernate_or_resume) +{ + mutex_lock(&tuxonice_in_use); + + oldfs = get_fs(); + set_fs(KERNEL_DS); + + toi_trace_index = 0; + + if (hibernate_or_resume) { + lock_system_sleep(); + + if (!atomic_add_unless(&snapshot_device_available, -1, 0)) + goto snapshotdevice_unavailable; + } + + if (hibernate_or_resume == SYSFS_HIBERNATE) + toi_print_modules(); + + if (toi_get_modules()) { + printk(KERN_INFO "TuxOnIce: Get modules failed!\n"); + goto prehibernate_err; + } + + if (hibernate_or_resume) { + block_dump_save = block_dump; + block_dump = 0; + set_cpus_allowed_ptr(current, + cpumask_of(cpumask_first(cpu_online_mask))); + } + + if (toi_initialise_modules_early(hibernate_or_resume)) + goto early_init_err; + + if (!toiActiveAllocator) + toi_attempt_to_parse_resume_device(!hibernate_or_resume); + + if (!toi_initialise_modules_late(hibernate_or_resume)) { + toi_running = 1; /* For the swsusp code we use :< */ + return 0; + } + + toi_cleanup_modules(hibernate_or_resume); +early_init_err: + if (hibernate_or_resume) { + block_dump_save = block_dump; + set_cpus_allowed_ptr(current, cpu_all_mask); + } + toi_put_modules(); +prehibernate_err: + if (hibernate_or_resume) + atomic_inc(&snapshot_device_available); +snapshotdevice_unavailable: + if (hibernate_or_resume) + mutex_unlock(&pm_mutex); + set_fs(oldfs); + mutex_unlock(&tuxonice_in_use); + return -EBUSY; +} + +/* + * Nosave page tracking. + * + * Here rather than in prepare_image because we want to do it once only at the + * start of a cycle. + */ + +/** + * mark_nosave_pages - set up our Nosave bitmap + * + * Build a bitmap of Nosave pages from the list. The bitmap allows faster + * use when preparing the image. + **/ +static void mark_nosave_pages(void) +{ + struct nosave_region *region; + + list_for_each_entry(region, &nosave_regions, list) { + unsigned long pfn; + + for (pfn = region->start_pfn; pfn < region->end_pfn; pfn++) + if (pfn_valid(pfn)) { + SetPageNosave(pfn_to_page(pfn)); + } + } +} + +/** + * allocate_bitmaps - allocate bitmaps used to record page states + * + * Allocate the bitmaps we use to record the various TuxOnIce related + * page states. + **/ +static int allocate_bitmaps(void) +{ + if (toi_alloc_bitmap(&pageset1_map) || + toi_alloc_bitmap(&pageset1_copy_map) || + toi_alloc_bitmap(&pageset2_map) || + toi_alloc_bitmap(&io_map) || + toi_alloc_bitmap(&nosave_map) || + toi_alloc_bitmap(&free_map) || + toi_alloc_bitmap(&compare_map) || + toi_alloc_bitmap(&page_resave_map)) + return 1; + + return 0; +} + +/** + * free_bitmaps - free the bitmaps used to record page states + * + * Free the bitmaps allocated above. It is not an error to call + * memory_bm_free on a bitmap that isn't currently allocated. + **/ +static void free_bitmaps(void) +{ + toi_free_bitmap(&pageset1_map); + toi_free_bitmap(&pageset1_copy_map); + toi_free_bitmap(&pageset2_map); + toi_free_bitmap(&io_map); + toi_free_bitmap(&nosave_map); + toi_free_bitmap(&free_map); + toi_free_bitmap(&compare_map); + toi_free_bitmap(&page_resave_map); +} + +/** + * io_MB_per_second - return the number of MB/s read or written + * @write: Whether to return the speed at which we wrote. + * + * Calculate the number of megabytes per second that were read or written. + **/ +static int io_MB_per_second(int write) +{ + return (toi_bkd.toi_io_time[write][1]) ? + MB((unsigned long) toi_bkd.toi_io_time[write][0]) * HZ / + toi_bkd.toi_io_time[write][1] : 0; +} + +#define SNPRINTF(a...) do { len += scnprintf(((char *) buffer) + len, \ + count - len - 1, ## a); } while (0) + +/** + * get_debug_info - fill a buffer with debugging information + * @buffer: The buffer to be filled. + * @count: The size of the buffer, in bytes. + * + * Fill a (usually PAGE_SIZEd) buffer with the debugging info that we will + * either printk or return via sysfs. + **/ +static int get_toi_debug_info(const char *buffer, int count) +{ + int len = 0, i, first_result = 1; + + SNPRINTF("TuxOnIce debugging info:\n"); + SNPRINTF("- TuxOnIce core : " TOI_CORE_VERSION "\n"); + SNPRINTF("- Kernel Version : " UTS_RELEASE "\n"); + SNPRINTF("- Compiler vers. : %d.%d\n", __GNUC__, __GNUC_MINOR__); + SNPRINTF("- Attempt number : %d\n", nr_hibernates); + SNPRINTF("- Parameters : %ld %ld %ld %d %ld %ld\n", + toi_result, + toi_bkd.toi_action, + toi_bkd.toi_debug_state, + toi_bkd.toi_default_console_level, + image_size_limit, + toi_poweroff_method); + SNPRINTF("- Overall expected compression percentage: %d.\n", + 100 - toi_expected_compression_ratio()); + len += toi_print_module_debug_info(((char *) buffer) + len, + count - len - 1); + if (toi_bkd.toi_io_time[0][1]) { + if ((io_MB_per_second(0) < 5) || (io_MB_per_second(1) < 5)) { + SNPRINTF("- I/O speed: Write %ld KB/s", + (KB((unsigned long) toi_bkd.toi_io_time[0][0]) * HZ / + toi_bkd.toi_io_time[0][1])); + if (toi_bkd.toi_io_time[1][1]) + SNPRINTF(", Read %ld KB/s", + (KB((unsigned long) + toi_bkd.toi_io_time[1][0]) * HZ / + toi_bkd.toi_io_time[1][1])); + } else { + SNPRINTF("- I/O speed: Write %ld MB/s", + (MB((unsigned long) toi_bkd.toi_io_time[0][0]) * HZ / + toi_bkd.toi_io_time[0][1])); + if (toi_bkd.toi_io_time[1][1]) + SNPRINTF(", Read %ld MB/s", + (MB((unsigned long) + toi_bkd.toi_io_time[1][0]) * HZ / + toi_bkd.toi_io_time[1][1])); + } + SNPRINTF(".\n"); + } else + SNPRINTF("- No I/O speed stats available.\n"); + SNPRINTF("- Extra pages : %lu used/%lu.\n", + extra_pd1_pages_used, extra_pd1_pages_allowance); + + for (i = 0; i < TOI_NUM_RESULT_STATES; i++) + if (test_result_state(i)) { + SNPRINTF("%s: %s.\n", first_result ? + "- Result " : + " ", + result_strings[i]); + first_result = 0; + } + if (first_result) + SNPRINTF("- Result : %s.\n", nr_hibernates ? + "Succeeded" : + "No hibernation attempts so far"); + return len; +} + +#ifdef CONFIG_TOI_INCREMENTAL +/** + * get_toi_page_state - fill a buffer with page state information + * @buffer: The buffer to be filled. + * @count: The size of the buffer, in bytes. + * + * Fill a (usually PAGE_SIZEd) buffer with the debugging info that we will + * either printk or return via sysfs. + **/ +static int get_toi_page_state(const char *buffer, int count) +{ + int free = 0, untracked = 0, dirty = 0, ro = 0, invalid = 0, other = 0, total = 0; + int len = 0; + struct zone *zone; + int allocated_bitmaps = 0; + + set_cpus_allowed_ptr(current, + cpumask_of(cpumask_first(cpu_online_mask))); + + if (!free_map) { + BUG_ON(toi_alloc_bitmap(&free_map)); + allocated_bitmaps = 1; + } + + toi_generate_free_page_map(); + + for_each_populated_zone(zone) { + unsigned long loop; + + total += zone->spanned_pages; + + for (loop = 0; loop < zone->spanned_pages; loop++) { + unsigned long pfn = zone->zone_start_pfn + loop; + struct page *page; + int chunk_size; + + if (!pfn_valid(pfn)) { + continue; + } + + chunk_size = toi_size_of_free_region(zone, pfn); + if (chunk_size) { + /* + * If the page gets allocated, it will be need + * saving in an image. + * Don't bother with explicitly removing any + * RO protection applied below. + * We'll SetPageTOI_Dirty(page) if/when it + * gets allocated. + */ + free += chunk_size; + loop += chunk_size - 1; + continue; + } + + page = pfn_to_page(pfn); + + if (PageTOI_Untracked(page)) { + untracked++; + } else if (PageTOI_RO(page)) { + ro++; + } else if (PageTOI_Dirty(page)) { + dirty++; + } else { + printk("Page %ld state 'other'.\n", pfn); + other++; + } + } + } + + if (allocated_bitmaps) { + toi_free_bitmap(&free_map); + } + + set_cpus_allowed_ptr(current, cpu_all_mask); + + SNPRINTF("TuxOnIce page breakdown:\n"); + SNPRINTF("- Free : %d\n", free); + SNPRINTF("- Untracked : %d\n", untracked); + SNPRINTF("- Read only : %d\n", ro); + SNPRINTF("- Dirty : %d\n", dirty); + SNPRINTF("- Other : %d\n", other); + SNPRINTF("- Invalid : %d\n", invalid); + SNPRINTF("- Total : %d\n", total); + return len; +} +#endif + +/** + * do_cleanup - cleanup after attempting to hibernate or resume + * @get_debug_info: Whether to allocate and return debugging info. + * + * Cleanup after attempting to hibernate or resume, possibly getting + * debugging info as we do so. + **/ +static void do_cleanup(int get_debug_info, int restarting) +{ + int i = 0; + char *buffer = NULL; + + trap_non_toi_io = 0; + + if (get_debug_info) + toi_prepare_status(DONT_CLEAR_BAR, "Cleaning up..."); + + free_checksum_pages(); + + toi_cbw_restore(); + toi_free_cbw_data(); + + if (get_debug_info) + buffer = (char *) toi_get_zeroed_page(20, TOI_ATOMIC_GFP); + + if (buffer) + i = get_toi_debug_info(buffer, PAGE_SIZE); + + toi_free_extra_pagedir_memory(); + + pagedir1.size = 0; + pagedir2.size = 0; + set_highmem_size(pagedir1, 0); + set_highmem_size(pagedir2, 0); + + if (boot_kernel_data_buffer) { + if (!test_toi_state(TOI_BOOT_KERNEL)) + toi_free_page(37, boot_kernel_data_buffer); + boot_kernel_data_buffer = 0; + } + + if (test_toi_state(TOI_DEVICE_HOTPLUG_LOCKED)) { + unlock_device_hotplug(); + clear_toi_state(TOI_DEVICE_HOTPLUG_LOCKED); + } + + clear_toi_state(TOI_BOOT_KERNEL); + if (current->flags & PF_SUSPEND_TASK) + thaw_processes(); + + if (!restarting) + toi_stop_other_threads(); + + if (toi_keeping_image && + !test_result_state(TOI_ABORTED)) { + toi_message(TOI_ANY_SECTION, TOI_LOW, 1, + "TuxOnIce: Not invalidating the image due " + "to Keep Image or Incremental Image being enabled."); + set_result_state(TOI_KEPT_IMAGE); + + /* + * For an incremental image, free unused storage so + * swap (if any) can be used for normal system operation, + * if so desired. + */ + + toiActiveAllocator->free_unused_storage(); + } else + if (toiActiveAllocator) + toiActiveAllocator->remove_image(); + + free_bitmaps(); + usermodehelper_enable(); + + if (test_toi_state(TOI_NOTIFIERS_PREPARE)) { + pm_notifier_call_chain(PM_POST_HIBERNATION); + clear_toi_state(TOI_NOTIFIERS_PREPARE); + } + + if (buffer && i) { + /* Printk can only handle 1023 bytes, including + * its level mangling. */ + for (i = 0; i < 3; i++) + printk(KERN_ERR "%s", buffer + (1023 * i)); + toi_free_page(20, (unsigned long) buffer); + } + + if (!restarting) + toi_cleanup_console(); + + free_attention_list(); + + if (!restarting) + toi_deactivate_storage(0); + + clear_toi_state(TOI_IGNORE_LOGLEVEL); + clear_toi_state(TOI_TRYING_TO_RESUME); + clear_toi_state(TOI_NOW_RESUMING); +} + +/** + * check_still_keeping_image - we kept an image; check whether to reuse it. + * + * We enter this routine when we have kept an image. If the user has said they + * want to still keep it, all we need to do is powerdown. If powering down + * means hibernating to ram and the power doesn't run out, we'll return 1. + * If we do power off properly or the battery runs out, we'll resume via the + * normal paths. + * + * If the user has said they want to remove the previously kept image, we + * remove it, and return 0. We'll then store a new image. + **/ +static int check_still_keeping_image(void) +{ + if (toi_keeping_image) { + if (!test_action_state(TOI_INCREMENTAL_IMAGE)) { + printk(KERN_INFO "Image already stored: powering down " + "immediately."); + do_toi_step(STEP_HIBERNATE_POWERDOWN); + return 1; + } + /** + * Incremental image - need to write new part. + * We detect that we're writing an incremental image by looking + * at test_result_state(TOI_KEPT_IMAGE) + **/ + return 0; + } + + printk(KERN_INFO "Invalidating previous image.\n"); + toiActiveAllocator->remove_image(); + + return 0; +} + +/** + * toi_init - prepare to hibernate to disk + * + * Initialise variables & data structures, in preparation for + * hibernating to disk. + **/ +static int toi_init(int restarting) +{ + int result, i, j; + + toi_result = 0; + + printk(KERN_INFO "Initiating a hibernation cycle.\n"); + + nr_hibernates++; + + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + toi_bkd.toi_io_time[i][j] = 0; + + if (!test_toi_state(TOI_CAN_HIBERNATE) || + allocate_bitmaps()) + return 1; + + mark_nosave_pages(); + + if (!restarting) + toi_prepare_console(); + + result = pm_notifier_call_chain(PM_HIBERNATION_PREPARE); + if (result) { + set_result_state(TOI_NOTIFIERS_PREPARE_FAILED); + return 1; + } + set_toi_state(TOI_NOTIFIERS_PREPARE); + + if (!restarting) { + printk(KERN_ERR "Starting other threads."); + toi_start_other_threads(); + } + + result = usermodehelper_disable(); + if (result) { + printk(KERN_ERR "TuxOnIce: Failed to disable usermode " + "helpers\n"); + set_result_state(TOI_USERMODE_HELPERS_ERR); + return 1; + } + + boot_kernel_data_buffer = toi_get_zeroed_page(37, TOI_ATOMIC_GFP); + if (!boot_kernel_data_buffer) { + printk(KERN_ERR "TuxOnIce: Failed to allocate " + "boot_kernel_data_buffer.\n"); + set_result_state(TOI_OUT_OF_MEMORY); + return 1; + } + + toi_allocate_cbw_data(); + + return 0; +} + +/** + * can_hibernate - perform basic 'Can we hibernate?' tests + * + * Perform basic tests that must pass if we're going to be able to hibernate: + * Can we get the pm_mutex? Is resume= valid (we need to know where to write + * the image header). + **/ +static int can_hibernate(void) +{ + if (!test_toi_state(TOI_CAN_HIBERNATE)) + toi_attempt_to_parse_resume_device(0); + + if (!test_toi_state(TOI_CAN_HIBERNATE)) { + printk(KERN_INFO "TuxOnIce: Hibernation is disabled.\n" + "This may be because you haven't put something along " + "the lines of\n\nresume=swap:/dev/hda1\n\n" + "in lilo.conf or equivalent. (Where /dev/hda1 is your " + "swap partition).\n"); + set_abort_result(TOI_CANT_SUSPEND); + return 0; + } + + if (strlen(alt_resume_param)) { + attempt_to_parse_alt_resume_param(); + + if (!strlen(alt_resume_param)) { + printk(KERN_INFO "Alternate resume parameter now " + "invalid. Aborting.\n"); + set_abort_result(TOI_CANT_USE_ALT_RESUME); + return 0; + } + } + + return 1; +} + +/** + * do_post_image_write - having written an image, figure out what to do next + * + * After writing an image, we might load an alternate image or power down. + * Powering down might involve hibernating to ram, in which case we also + * need to handle reloading pageset2. + **/ +static int do_post_image_write(void) +{ + /* If switching images fails, do normal powerdown */ + if (alt_resume_param[0]) + do_toi_step(STEP_RESUME_ALT_IMAGE); + + toi_power_down(); + + barrier(); + mb(); + return 0; +} + +/** + * __save_image - do the hard work of saving the image + * + * High level routine for getting the image saved. The key assumptions made + * are that processes have been frozen and sufficient memory is available. + * + * We also exit through here at resume time, coming back from toi_hibernate + * after the atomic restore. This is the reason for the toi_in_hibernate + * test. + **/ +static int __save_image(void) +{ + int temp_result, did_copy = 0; + + toi_prepare_status(DONT_CLEAR_BAR, "Starting to save the image.."); + + toi_message(TOI_ANY_SECTION, TOI_LOW, 1, + " - Final values: %d and %d.", + pagedir1.size, pagedir2.size); + + toi_cond_pause(1, "About to write pagedir2."); + + temp_result = write_pageset(&pagedir2); + + if (temp_result == -1 || test_result_state(TOI_ABORTED)) + return 1; + + toi_cond_pause(1, "About to copy pageset 1."); + + if (test_result_state(TOI_ABORTED)) + return 1; + + toi_deactivate_storage(1); + + toi_prepare_status(DONT_CLEAR_BAR, "Doing atomic copy/restore."); + + toi_in_hibernate = 1; + + if (toi_go_atomic(PMSG_FREEZE, 1)) + goto Failed; + + temp_result = toi_hibernate(); + +#ifdef CONFIG_KGDB + if (test_action_state(TOI_POST_RESUME_BREAKPOINT)) + kgdb_breakpoint(); +#endif + + if (!temp_result) + did_copy = 1; + + /* We return here at resume time too! */ + toi_end_atomic(ATOMIC_ALL_STEPS, toi_in_hibernate, temp_result); + +Failed: + if (toi_activate_storage(1)) + panic("Failed to reactivate our storage."); + + /* Resume time? */ + if (!toi_in_hibernate) { + copyback_post(); + return 0; + } + + /* Nope. Hibernating. So, see if we can save the image... */ + + if (temp_result || test_result_state(TOI_ABORTED)) { + if (did_copy) + goto abort_reloading_pagedir_two; + else + return 1; + } + + toi_update_status(pagedir2.size, pagedir1.size + pagedir2.size, + NULL); + + if (test_result_state(TOI_ABORTED)) + goto abort_reloading_pagedir_two; + + toi_cond_pause(1, "About to write pageset1."); + + toi_message(TOI_ANY_SECTION, TOI_LOW, 1, "-- Writing pageset1"); + + temp_result = write_pageset(&pagedir1); + + /* We didn't overwrite any memory, so no reread needs to be done. */ + if (test_action_state(TOI_TEST_FILTER_SPEED) || + test_action_state(TOI_TEST_BIO)) + return 1; + + if (temp_result == 1 || test_result_state(TOI_ABORTED)) + goto abort_reloading_pagedir_two; + + toi_cond_pause(1, "About to write header."); + + if (test_result_state(TOI_ABORTED)) + goto abort_reloading_pagedir_two; + + temp_result = write_image_header(); + + if (!temp_result && !test_result_state(TOI_ABORTED)) + return 0; + +abort_reloading_pagedir_two: + temp_result = read_pageset2(1); + + /* If that failed, we're sunk. Panic! */ + if (temp_result) + panic("Attempt to reload pagedir 2 while aborting " + "a hibernate failed."); + + return 1; +} + +static void map_ps2_pages(int enable) +{ + unsigned long pfn = 0; + + memory_bm_position_reset(pageset2_map); + pfn = memory_bm_next_pfn(pageset2_map, 0); + + while (pfn != BM_END_OF_MAP) { + struct page *page = pfn_to_page(pfn); + kernel_map_pages(page, 1, enable); + pfn = memory_bm_next_pfn(pageset2_map, 0); + } +} + +/** + * do_save_image - save the image and handle the result + * + * Save the prepared image. If we fail or we're in the path returning + * from the atomic restore, cleanup. + **/ +static int do_save_image(void) +{ + int result; + map_ps2_pages(0); + result = __save_image(); + map_ps2_pages(1); + return result; +} + +/** + * do_prepare_image - try to prepare an image + * + * Seek to initialise and prepare an image to be saved. On failure, + * cleanup. + **/ +static int do_prepare_image(void) +{ + int restarting = test_result_state(TOI_EXTRA_PAGES_ALLOW_TOO_SMALL); + + if (!restarting && toi_activate_storage(0)) + return 1; + + /* + * If kept image and still keeping image and hibernating to RAM, (non + * incremental image case) we will return 1 after hibernating and + * resuming (provided the power doesn't run out. In that case, we skip + * directly to cleaning up and exiting. + */ + + if (!can_hibernate() || + (test_result_state(TOI_KEPT_IMAGE) && + check_still_keeping_image())) + return 1; + + if (toi_init(restarting) || toi_prepare_image() || + test_result_state(TOI_ABORTED)) + return 1; + + trap_non_toi_io = 1; + + return 0; +} + +/** + * do_check_can_resume - find out whether an image has been stored + * + * Read whether an image exists. We use the same routine as the + * image_exists sysfs entry, and just look to see whether the + * first character in the resulting buffer is a '1'. + **/ +int do_check_can_resume(void) +{ + int result = -1; + + if (toi_activate_storage(0)) + return -1; + + if (!test_toi_state(TOI_RESUME_DEVICE_OK)) + toi_attempt_to_parse_resume_device(1); + + if (toiActiveAllocator) + result = toiActiveAllocator->image_exists(1); + + toi_deactivate_storage(0); + return result; +} + +/** + * do_load_atomic_copy - load the first part of an image, if it exists + * + * Check whether we have an image. If one exists, do sanity checking + * (possibly invalidating the image or even rebooting if the user + * requests that) before loading it into memory in preparation for the + * atomic restore. + * + * If and only if we have an image loaded and ready to restore, we return 1. + **/ +static int do_load_atomic_copy(void) +{ + int read_image_result = 0; + + if (sizeof(swp_entry_t) != sizeof(long)) { + printk(KERN_WARNING "TuxOnIce: The size of swp_entry_t != size" + " of long. Please report this!\n"); + return 1; + } + + if (!resume_file[0]) + printk(KERN_WARNING "TuxOnIce: " + "You need to use a resume= command line parameter to " + "tell TuxOnIce where to look for an image.\n"); + + toi_activate_storage(0); + + if (!(test_toi_state(TOI_RESUME_DEVICE_OK)) && + !toi_attempt_to_parse_resume_device(0)) { + /* + * Without a usable storage device we can do nothing - + * even if noresume is given + */ + + if (!toiNumAllocators) + printk(KERN_ALERT "TuxOnIce: " + "No storage allocators have been registered.\n"); + else + printk(KERN_ALERT "TuxOnIce: " + "Missing or invalid storage location " + "(resume= parameter). Please correct and " + "rerun lilo (or equivalent) before " + "hibernating.\n"); + toi_deactivate_storage(0); + return 1; + } + + if (allocate_bitmaps()) + return 1; + + read_image_result = read_pageset1(); /* non fatal error ignored */ + + if (test_toi_state(TOI_NORESUME_SPECIFIED)) + clear_toi_state(TOI_NORESUME_SPECIFIED); + + toi_deactivate_storage(0); + + if (read_image_result) + return 1; + + return 0; +} + +/** + * prepare_restore_load_alt_image - save & restore alt image variables + * + * Save and restore the pageset1 maps, when loading an alternate image. + **/ +static void prepare_restore_load_alt_image(int prepare) +{ + static struct memory_bitmap *pageset1_map_save, *pageset1_copy_map_save; + + if (prepare) { + pageset1_map_save = pageset1_map; + pageset1_map = NULL; + pageset1_copy_map_save = pageset1_copy_map; + pageset1_copy_map = NULL; + set_toi_state(TOI_LOADING_ALT_IMAGE); + toi_reset_alt_image_pageset2_pfn(); + } else { + toi_free_bitmap(&pageset1_map); + pageset1_map = pageset1_map_save; + toi_free_bitmap(&pageset1_copy_map); + pageset1_copy_map = pageset1_copy_map_save; + clear_toi_state(TOI_NOW_RESUMING); + clear_toi_state(TOI_LOADING_ALT_IMAGE); + } +} + +/** + * do_toi_step - perform a step in hibernating or resuming + * + * Perform a step in hibernating or resuming an image. This abstraction + * is in preparation for implementing cluster support, and perhaps replacing + * uswsusp too (haven't looked whether that's possible yet). + **/ +int do_toi_step(int step) +{ + switch (step) { + case STEP_HIBERNATE_PREPARE_IMAGE: + return do_prepare_image(); + case STEP_HIBERNATE_SAVE_IMAGE: + return do_save_image(); + case STEP_HIBERNATE_POWERDOWN: + return do_post_image_write(); + case STEP_RESUME_CAN_RESUME: + return do_check_can_resume(); + case STEP_RESUME_LOAD_PS1: + return do_load_atomic_copy(); + case STEP_RESUME_DO_RESTORE: + /* + * If we succeed, this doesn't return. + * Instead, we return from do_save_image() in the + * hibernated kernel. + */ + return toi_atomic_restore(); + case STEP_RESUME_ALT_IMAGE: + printk(KERN_INFO "Trying to resume alternate image.\n"); + toi_in_hibernate = 0; + save_restore_alt_param(SAVE, NOQUIET); + prepare_restore_load_alt_image(1); + if (!do_check_can_resume()) { + printk(KERN_INFO "Nothing to resume from.\n"); + goto out; + } + if (!do_load_atomic_copy()) + toi_atomic_restore(); + + printk(KERN_INFO "Failed to load image.\n"); +out: + prepare_restore_load_alt_image(0); + save_restore_alt_param(RESTORE, NOQUIET); + break; + case STEP_CLEANUP: + do_cleanup(1, 0); + break; + case STEP_QUIET_CLEANUP: + do_cleanup(0, 0); + break; + } + + return 0; +} + +/* -- Functions for kickstarting a hibernate or resume --- */ + +/** + * toi_try_resume - try to do the steps in resuming + * + * Check if we have an image and if so try to resume. Clear the status + * flags too. + **/ +void toi_try_resume(void) +{ + set_toi_state(TOI_TRYING_TO_RESUME); + resume_attempted = 1; + + current->flags |= PF_MEMALLOC; + toi_start_other_threads(); + + if (do_toi_step(STEP_RESUME_CAN_RESUME) && + !do_toi_step(STEP_RESUME_LOAD_PS1)) + do_toi_step(STEP_RESUME_DO_RESTORE); + + toi_stop_other_threads(); + do_cleanup(0, 0); + + current->flags &= ~PF_MEMALLOC; + + clear_toi_state(TOI_IGNORE_LOGLEVEL); + clear_toi_state(TOI_TRYING_TO_RESUME); + clear_toi_state(TOI_NOW_RESUMING); +} + +/** + * toi_sys_power_disk_try_resume - wrapper calling toi_try_resume + * + * Wrapper for when __toi_try_resume is called from swsusp resume path, + * rather than from echo > /sys/power/tuxonice/do_resume. + **/ +static void toi_sys_power_disk_try_resume(void) +{ + resume_attempted = 1; + + /* + * There's a comment in kernel/power/disk.c that indicates + * we should be able to use mutex_lock_nested below. That + * doesn't seem to cut it, though, so let's just turn lockdep + * off for now. + */ + lockdep_off(); + + if (toi_start_anything(SYSFS_RESUMING)) + goto out; + + toi_try_resume(); + + /* + * For initramfs, we have to clear the boot time + * flag after trying to resume + */ + clear_toi_state(TOI_BOOT_TIME); + + toi_finish_anything(SYSFS_RESUMING); +out: + lockdep_on(); +} + +/** + * toi_try_hibernate - try to start a hibernation cycle + * + * Start a hibernation cycle, coming in from either + * echo > /sys/power/tuxonice/do_suspend + * + * or + * + * echo disk > /sys/power/state + * + * In the later case, we come in without pm_sem taken; in the + * former, it has been taken. + **/ +int toi_try_hibernate(void) +{ + int result = 0, sys_power_disk = 0, retries = 0; + + if (!mutex_is_locked(&tuxonice_in_use)) { + /* Came in via /sys/power/disk */ + if (toi_start_anything(SYSFS_HIBERNATING)) + return -EBUSY; + sys_power_disk = 1; + } + + current->flags |= PF_MEMALLOC; + + if (test_toi_state(TOI_CLUSTER_MODE)) { + toi_initiate_cluster_hibernate(); + goto out; + } + +prepare: + result = do_toi_step(STEP_HIBERNATE_PREPARE_IMAGE); + + if (result) + goto out; + + if (test_action_state(TOI_FREEZER_TEST)) + goto out_restore_gfp_mask; + + result = do_toi_step(STEP_HIBERNATE_SAVE_IMAGE); + + if (test_result_state(TOI_EXTRA_PAGES_ALLOW_TOO_SMALL)) { + if (retries < 2) { + do_cleanup(0, 1); + retries++; + clear_result_state(TOI_ABORTED); + extra_pd1_pages_allowance = extra_pd1_pages_used + 500; + printk(KERN_INFO "Automatically adjusting the extra" + " pages allowance to %ld and restarting.\n", + extra_pd1_pages_allowance); + pm_restore_gfp_mask(); + goto prepare; + } + + printk(KERN_INFO "Adjusted extra pages allowance twice and " + "still couldn't hibernate successfully. Giving up."); + } + + /* This code runs at resume time too! */ + if (!result && toi_in_hibernate) + result = do_toi_step(STEP_HIBERNATE_POWERDOWN); + +out_restore_gfp_mask: + pm_restore_gfp_mask(); +out: + do_cleanup(1, 0); + current->flags &= ~PF_MEMALLOC; + + if (sys_power_disk) + toi_finish_anything(SYSFS_HIBERNATING); + + return result; +} + +/* + * channel_no: If !0, -c is added to args (userui). + */ +int toi_launch_userspace_program(char *command, int channel_no, + int wait, int debug) +{ + int retval; + static char *envp[] = { + "HOME=/", + "TERM=linux", + "PATH=/sbin:/usr/sbin:/bin:/usr/bin", + NULL }; + static char *argv[] = { NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL + }; + char *channel = NULL; + int arg = 0, size; + char test_read[255]; + char *orig_posn = command; + + if (!strlen(orig_posn)) + return 1; + + if (channel_no) { + channel = toi_kzalloc(4, 6, GFP_KERNEL); + if (!channel) { + printk(KERN_INFO "Failed to allocate memory in " + "preparing to launch userspace program.\n"); + return 1; + } + } + + /* Up to 6 args supported */ + while (arg < 6) { + sscanf(orig_posn, "%s", test_read); + size = strlen(test_read); + if (!(size)) + break; + argv[arg] = toi_kzalloc(5, size + 1, TOI_ATOMIC_GFP); + strcpy(argv[arg], test_read); + orig_posn += size + 1; + *test_read = 0; + arg++; + } + + if (channel_no) { + sprintf(channel, "-c%d", channel_no); + argv[arg] = channel; + } else + arg--; + + if (debug) { + argv[++arg] = toi_kzalloc(5, 8, TOI_ATOMIC_GFP); + strcpy(argv[arg], "--debug"); + } + + retval = call_usermodehelper(argv[0], argv, envp, wait); + + /* + * If the program reports an error, retval = 256. Don't complain + * about that here. + */ + if (retval && retval != 256) + printk(KERN_ERR "Failed to launch userspace program '%s': " + "Error %d\n", command, retval); + + { + int i; + for (i = 0; i < arg; i++) + if (argv[i] && argv[i] != channel) + toi_kfree(5, argv[i], sizeof(*argv[i])); + } + + toi_kfree(4, channel, sizeof(*channel)); + + return retval; +} + +/* + * This array contains entries that are automatically registered at + * boot. Modules and the console code register their own entries separately. + */ +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_LONG("extra_pages_allowance", SYSFS_RW, + &extra_pd1_pages_allowance, 0, LONG_MAX, 0), + SYSFS_CUSTOM("image_exists", SYSFS_RW, image_exists_read, + image_exists_write, SYSFS_NEEDS_SM_FOR_BOTH, NULL), + SYSFS_STRING("resume", SYSFS_RW, resume_file, 255, + SYSFS_NEEDS_SM_FOR_WRITE, + attempt_to_parse_resume_device2), + SYSFS_STRING("alt_resume_param", SYSFS_RW, alt_resume_param, 255, + SYSFS_NEEDS_SM_FOR_WRITE, + attempt_to_parse_alt_resume_param), + SYSFS_CUSTOM("debug_info", SYSFS_READONLY, get_toi_debug_info, NULL, 0, + NULL), + SYSFS_BIT("ignore_rootfs", SYSFS_RW, &toi_bkd.toi_action, + TOI_IGNORE_ROOTFS, 0), + SYSFS_LONG("image_size_limit", SYSFS_RW, &image_size_limit, -2, + INT_MAX, 0), + SYSFS_UL("last_result", SYSFS_RW, &toi_result, 0, 0, 0), + SYSFS_BIT("no_multithreaded_io", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_MULTITHREADED_IO, 0), + SYSFS_BIT("no_flusher_thread", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_FLUSHER_THREAD, 0), + SYSFS_BIT("full_pageset2", SYSFS_RW, &toi_bkd.toi_action, + TOI_PAGESET2_FULL, 0), + SYSFS_BIT("reboot", SYSFS_RW, &toi_bkd.toi_action, TOI_REBOOT, 0), + SYSFS_BIT("replace_swsusp", SYSFS_RW, &toi_bkd.toi_action, + TOI_REPLACE_SWSUSP, 0), + SYSFS_STRING("resume_commandline", SYSFS_RW, + toi_bkd.toi_nosave_commandline, COMMAND_LINE_SIZE, 0, + NULL), + SYSFS_STRING("version", SYSFS_READONLY, TOI_CORE_VERSION, 0, 0, NULL), + SYSFS_BIT("freezer_test", SYSFS_RW, &toi_bkd.toi_action, + TOI_FREEZER_TEST, 0), + SYSFS_BIT("test_bio", SYSFS_RW, &toi_bkd.toi_action, TOI_TEST_BIO, 0), + SYSFS_BIT("test_filter_speed", SYSFS_RW, &toi_bkd.toi_action, + TOI_TEST_FILTER_SPEED, 0), + SYSFS_BIT("no_pageset2", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_PAGESET2, 0), + SYSFS_BIT("no_pageset2_if_unneeded", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_PS2_IF_UNNEEDED, 0), + SYSFS_STRING("binary_signature", SYSFS_READONLY, + tuxonice_signature, 9, 0, NULL), + SYSFS_INT("max_workers", SYSFS_RW, &toi_max_workers, 0, NR_CPUS, 0, + NULL), +#ifdef CONFIG_KGDB + SYSFS_BIT("post_resume_breakpoint", SYSFS_RW, &toi_bkd.toi_action, + TOI_POST_RESUME_BREAKPOINT, 0), +#endif + SYSFS_BIT("no_readahead", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_READAHEAD, 0), + SYSFS_BIT("trace_debug_on", SYSFS_RW, &toi_bkd.toi_action, + TOI_TRACE_DEBUG_ON, 0), +#ifdef CONFIG_TOI_KEEP_IMAGE + SYSFS_BIT("keep_image", SYSFS_RW , &toi_bkd.toi_action, TOI_KEEP_IMAGE, + 0), +#endif +#ifdef CONFIG_TOI_INCREMENTAL + SYSFS_CUSTOM("pagestate", SYSFS_READONLY, get_toi_page_state, NULL, 0, + NULL), + SYSFS_BIT("incremental", SYSFS_RW, &toi_bkd.toi_action, + TOI_INCREMENTAL_IMAGE, 1), +#endif +}; + +static struct toi_core_fns my_fns = { + .get_nonconflicting_page = __toi_get_nonconflicting_page, + .post_context_save = __toi_post_context_save, + .try_hibernate = toi_try_hibernate, + .try_resume = toi_sys_power_disk_try_resume, +}; + +/** + * core_load - initialisation of TuxOnIce core + * + * Initialise the core, beginning with sysfs. Checksum and so on are part of + * the core, but have their own initialisation routines because they either + * aren't compiled in all the time or have their own subdirectories. + **/ +static __init int core_load(void) +{ + int i, + numfiles = sizeof(sysfs_params) / sizeof(struct toi_sysfs_data); + + printk(KERN_INFO "TuxOnIce " TOI_CORE_VERSION + " (http://tuxonice.net)\n"); + + if (!hibernation_available()) { + printk(KERN_INFO "TuxOnIce disabled due to request for hibernation" + " to be disabled in this kernel.\n"); + return 1; + } + + if (toi_sysfs_init()) + return 1; + + for (i = 0; i < numfiles; i++) + toi_register_sysfs_file(tuxonice_kobj, &sysfs_params[i]); + + toi_core_fns = &my_fns; + + if (toi_alloc_init()) + return 1; + if (toi_checksum_init()) + return 1; + if (toi_usm_init()) + return 1; + if (toi_ui_init()) + return 1; + if (toi_poweroff_init()) + return 1; + if (toi_cluster_init()) + return 1; + if (toi_cbw_init()) + return 1; + + return 0; +} + +late_initcall(core_load); diff --git b/kernel/power/tuxonice_incremental.c b/kernel/power/tuxonice_incremental.c new file mode 100644 index 0000000..a8c5f36 --- /dev/null +++ b/kernel/power/tuxonice_incremental.c @@ -0,0 +1,402 @@ +/* + * kernel/power/tuxonice_incremental.c + * + * Copyright (C) 2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * This file contains routines related to storing incremental images - that + * is, retaining an image after an initial cycle and then storing incremental + * changes on subsequent hibernations. + * + * Based in part on on... + * + * Debug helper to dump the current kernel pagetables of the system + * so that we can see what the various memory ranges are set to. + * + * (C) Copyright 2008 Intel Corporation + * + * Author: Arjan van de Ven + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; version 2 + * of the License. + */ + +#include +#include +#include +#include +#include +#include +#include +#include "tuxonice_pageflags.h" +#include "tuxonice_builtin.h" +#include "power.h" + +int toi_do_incremental_initcall; + +extern void kdb_init(int level); +extern noinline void kgdb_breakpoint(void); + +#undef pr_debug +#if 0 +#define pr_debug(a, b...) do { printk(a, ##b); } while(0) +#else +#define pr_debug(a, b...) do { } while(0) +#endif + +/* Multipliers for offsets within the PTEs */ +#define PTE_LEVEL_MULT (PAGE_SIZE) +#define PMD_LEVEL_MULT (PTRS_PER_PTE * PTE_LEVEL_MULT) +#define PUD_LEVEL_MULT (PTRS_PER_PMD * PMD_LEVEL_MULT) +#define PGD_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT) + +/* + * This function gets called on a break in a continuous series + * of PTE entries; the next one is different so we need to + * print what we collected so far. + */ +static void note_page(void *addr) +{ + static struct page *lastpage; + struct page *page; + + page = virt_to_page(addr); + + if (page != lastpage) { + unsigned int level; + pte_t *pte = lookup_address((unsigned long) addr, &level); + struct page *pt_page2 = pte_page(*pte); + //debug("Note page %p (=> %p => %p|%ld).\n", addr, pte, pt_page2, page_to_pfn(pt_page2)); + SetPageTOI_Untracked(pt_page2); + lastpage = page; + } +} + +static void walk_pte_level(pmd_t addr) +{ + int i; + pte_t *start; + + start = (pte_t *) pmd_page_vaddr(addr); + for (i = 0; i < PTRS_PER_PTE; i++) { + note_page(start); + start++; + } +} + +#if PTRS_PER_PMD > 1 + +static void walk_pmd_level(pud_t addr) +{ + int i; + pmd_t *start; + + start = (pmd_t *) pud_page_vaddr(addr); + for (i = 0; i < PTRS_PER_PMD; i++) { + if (!pmd_none(*start)) { + if (pmd_large(*start) || !pmd_present(*start)) + note_page(start); + else + walk_pte_level(*start); + } else + note_page(start); + start++; + } +} + +#else +#define walk_pmd_level(a) walk_pte_level(__pmd(pud_val(a))) +#define pud_large(a) pmd_large(__pmd(pud_val(a))) +#define pud_none(a) pmd_none(__pmd(pud_val(a))) +#endif + +#if PTRS_PER_PUD > 1 + +static void walk_pud_level(pgd_t addr) +{ + int i; + pud_t *start; + + start = (pud_t *) pgd_page_vaddr(addr); + + for (i = 0; i < PTRS_PER_PUD; i++) { + if (!pud_none(*start)) { + if (pud_large(*start) || !pud_present(*start)) + note_page(start); + else + walk_pmd_level(*start); + } else + note_page(start); + + start++; + } +} + +#else +#define walk_pud_level(a) walk_pmd_level(__pud(pgd_val(a))) +#define pgd_large(a) pud_large(__pud(pgd_val(a))) +#define pgd_none(a) pud_none(__pud(pgd_val(a))) +#endif + +/* + * Not static in the original at the time of writing, so needs renaming here. + */ +static void toi_ptdump_walk_pgd_level(pgd_t *pgd) +{ +#ifdef CONFIG_X86_64 + pgd_t *start = (pgd_t *) &init_level4_pgt; +#else + pgd_t *start = swapper_pg_dir; +#endif + int i; + if (pgd) { + start = pgd; + } + + for (i = 0; i < PTRS_PER_PGD; i++) { + if (!pgd_none(*start)) { + if (pgd_large(*start) || !pgd_present(*start)) + note_page(start); + else + walk_pud_level(*start); + } else + note_page(start); + + start++; + } + + /* Flush out the last page */ + note_page(start); +} + +#ifdef CONFIG_PARAVIRT +extern struct pv_info pv_info; + +static void toi_set_paravirt_ops_untracked(void) { + int i; + + unsigned long pvpfn = page_to_pfn(virt_to_page(__parainstructions)), + pvpfn_end = page_to_pfn(virt_to_page(__parainstructions_end)); + //debug(KERN_EMERG ".parainstructions goes from pfn %ld to %ld.\n", pvpfn, pvpfn_end); + for (i = pvpfn; i <= pvpfn_end; i++) { + SetPageTOI_Untracked(pfn_to_page(i)); + } +} +#else +#define toi_set_paravirt_ops_untracked() { do { } while(0) } +#endif + +extern void toi_mark_per_cpus_pages_untracked(void); + +void toi_untrack_stack(unsigned long *stack) +{ + int i; + struct page *stack_page = virt_to_page(stack); + + for (i = 0; i < (1 << THREAD_SIZE_ORDER); i++) { + pr_debug("Untrack stack page %p.\n", page_address(stack_page + i)); + SetPageTOI_Untracked(stack_page + i); + } +} +void toi_untrack_process(struct task_struct *p) +{ + SetPageTOI_Untracked(virt_to_page(p)); + pr_debug("Untrack process %d page %p.\n", p->pid, page_address(virt_to_page(p))); + + toi_untrack_stack(p->stack); +} + +void toi_generate_untracked_map(void) +{ + struct task_struct *p, *t; + struct page *page; + pte_t *pte; + int i; + unsigned int level; + static int been_here = 0; + + if (been_here) + return; + + been_here = 1; + + /* Pagetable pages */ + toi_ptdump_walk_pgd_level(NULL); + + /* Printk buffer - not normally needed but can be helpful for debugging. */ + //toi_set_logbuf_untracked(); + + /* Paravirt ops */ + toi_set_paravirt_ops_untracked(); + + /* Task structs and stacks */ + for_each_process_thread(p, t) { + toi_untrack_process(p); + //toi_untrack_stack((unsigned long *) t->thread.sp); + } + + for (i = 0; i < NR_CPUS; i++) { + struct task_struct *idle = idle_task(i); + + if (idle) { + pr_debug("Untrack idle process for CPU %d.\n", i); + toi_untrack_process(idle); + } + + /* IRQ stack */ + pr_debug("Untrack IRQ stack for CPU %d.\n", i); + toi_untrack_stack((unsigned long *)per_cpu(irq_stack_ptr, i)); + } + + /* Per CPU data */ + //pr_debug("Untracking per CPU variable pages.\n"); + toi_mark_per_cpus_pages_untracked(); + + /* Init stack - for bringing up secondary CPUs */ + page = virt_to_page(init_stack); + for (i = 0; i < DIV_ROUND_UP(sizeof(init_stack), PAGE_SIZE); i++) { + SetPageTOI_Untracked(page + i); + } + + pte = lookup_address((unsigned long) &mmu_cr4_features, &level); + SetPageTOI_Untracked(pte_page(*pte)); + SetPageTOI_Untracked(virt_to_page(trampoline_cr4_features)); +} + +/** + * toi_reset_dirtiness_one + */ + +void toi_reset_dirtiness_one(unsigned long pfn, int verbose) +{ + struct page *page = pfn_to_page(pfn); + + /** + * Don't worry about whether the Dirty flag is + * already set. If this is our first call, it + * won't be. + */ + + preempt_disable(); + + ClearPageTOI_Dirty(page); + SetPageTOI_RO(page); + if (verbose) + printk(KERN_EMERG "Making page %ld (%p|%p) read only.\n", pfn, page, page_address(page)); + + set_memory_ro((unsigned long) page_address(page), 1); + + preempt_enable(); +} + +/** + * TuxOnIce's incremental image support works by marking all memory apart from + * the page tables read-only, then in the page-faults that result enabling + * writing if appropriate and flagging the page as dirty. Free pages are also + * marked as dirty and not protected so that if allocated, they will be included + * in the image without further processing. + * + * toi_reset_dirtiness is called when and image exists and incremental images are + * enabled, and each time we resume thereafter. It is not invoked on a fresh boot. + * + * This routine should be called from a single-cpu-running context to avoid races in setting + * page dirty/read only flags. + * + * TODO: Make "it is not invoked on a fresh boot" true when I've finished developing it! + * + * TODO: Consider Xen paravirt guest boot issues. See arch/x86/mm/pageattr.c. + **/ + +int toi_reset_dirtiness(int verbose) +{ + struct zone *zone; + unsigned long loop; + int allocated_map = 0; + + toi_generate_untracked_map(); + + if (!free_map) { + if (!toi_alloc_bitmap(&free_map)) + return -ENOMEM; + allocated_map = 1; + } + + toi_generate_free_page_map(); + + pr_debug(KERN_EMERG "Reset dirtiness.\n"); + for_each_populated_zone(zone) { + // 64 bit only. No need to worry about highmem. + for (loop = 0; loop < zone->spanned_pages; loop++) { + unsigned long pfn = zone->zone_start_pfn + loop; + struct page *page; + int chunk_size; + + if (!pfn_valid(pfn)) { + continue; + } + + chunk_size = toi_size_of_free_region(zone, pfn); + if (chunk_size) { + loop += chunk_size - 1; + continue; + } + + page = pfn_to_page(pfn); + + if (PageNosave(page) || !saveable_page(zone, pfn)) { + continue; + } + + if (PageTOI_Untracked(page)) { + continue; + } + + /** + * Do we need to (re)protect the page? + * If it is already protected (PageTOI_RO), there is + * nothing to do - skip the following. + * If it is marked as dirty (PageTOI_Dirty), it was + * either free and has been allocated or has been + * written to and marked dirty. Reset the dirty flag + * and (re)apply the protection. + */ + if (!PageTOI_RO(page)) { + toi_reset_dirtiness_one(pfn, verbose); + } + } + } + + pr_debug(KERN_EMERG "Done resetting dirtiness.\n"); + + if (allocated_map) { + toi_free_bitmap(&free_map); + } + return 0; +} + +static int toi_reset_dirtiness_initcall(void) +{ + if (toi_do_incremental_initcall) { + pr_info("TuxOnIce: Enabling dirty page tracking.\n"); + toi_reset_dirtiness(0); + } + return 1; +} +extern void toi_generate_untracked_map(void); + +// Leave early_initcall for pages to register untracked sections. +early_initcall(toi_reset_dirtiness_initcall); + +static int __init toi_incremental_initcall_setup(char *str) +{ + int value; + + if (sscanf(str, "=%d", &value) && value) + toi_do_incremental_initcall = value; + + return 1; +} +__setup("toi_incremental_initcall", toi_incremental_initcall_setup); diff --git b/kernel/power/tuxonice_io.c b/kernel/power/tuxonice_io.c new file mode 100644 index 0000000..3c62c26 --- /dev/null +++ b/kernel/power/tuxonice_io.c @@ -0,0 +1,1932 @@ +/* + * kernel/power/tuxonice_io.c + * + * Copyright (C) 1998-2001 Gabor Kuti + * Copyright (C) 1998,2001,2002 Pavel Machek + * Copyright (C) 2002-2003 Florent Chabaud + * Copyright (C) 2002-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * It contains high level IO routines for hibernating. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_pageflags.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_storage.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_extent.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_builtin.h" +#include "tuxonice_checksum.h" +#include "tuxonice_alloc.h" +char alt_resume_param[256]; + +/* Version read from image header at resume */ +static int toi_image_header_version; + +#define read_if_version(VERS, VAR, DESC, ERR_ACT) do { \ + if (likely(toi_image_header_version >= VERS)) \ + if (toiActiveAllocator->rw_header_chunk(READ, NULL, \ + (char *) &VAR, sizeof(VAR))) { \ + abort_hibernate(TOI_FAILED_IO, "Failed to read DESC."); \ + ERR_ACT; \ + } \ +} while(0) \ + +/* Variables shared between threads and updated under the mutex */ +static int io_write, io_finish_at, io_base, io_barmax, io_pageset, io_result; +static int io_index, io_nextupdate, io_pc, io_pc_step; +static DEFINE_MUTEX(io_mutex); +static DEFINE_PER_CPU(struct page *, last_sought); +static DEFINE_PER_CPU(struct page *, last_high_page); +static DEFINE_PER_CPU(char *, checksum_locn); +static DEFINE_PER_CPU(struct pbe *, last_low_page); +static atomic_t io_count; +atomic_t toi_io_workers; + +static int using_flusher; + +DECLARE_WAIT_QUEUE_HEAD(toi_io_queue_flusher); + +int toi_bio_queue_flusher_should_finish; + +int toi_max_workers; + +static char *image_version_error = "The image header version is newer than " \ + "this kernel supports."; + +struct toi_module_ops *first_filter; + +static atomic_t toi_num_other_threads; +static DECLARE_WAIT_QUEUE_HEAD(toi_worker_wait_queue); +enum toi_worker_commands { + TOI_IO_WORKER_STOP, + TOI_IO_WORKER_RUN, + TOI_IO_WORKER_EXIT +}; +static enum toi_worker_commands toi_worker_command; + +/** + * toi_attempt_to_parse_resume_device - determine if we can hibernate + * + * Can we hibernate, using the current resume= parameter? + **/ +int toi_attempt_to_parse_resume_device(int quiet) +{ + struct list_head *Allocator; + struct toi_module_ops *thisAllocator; + int result, returning = 0; + + if (toi_activate_storage(0)) + return 0; + + toiActiveAllocator = NULL; + clear_toi_state(TOI_RESUME_DEVICE_OK); + clear_toi_state(TOI_CAN_RESUME); + clear_result_state(TOI_ABORTED); + + if (!toiNumAllocators) { + if (!quiet) + printk(KERN_INFO "TuxOnIce: No storage allocators have " + "been registered. Hibernating will be " + "disabled.\n"); + goto cleanup; + } + + list_for_each(Allocator, &toiAllocators) { + thisAllocator = list_entry(Allocator, struct toi_module_ops, + type_list); + + /* + * Not sure why you'd want to disable an allocator, but + * we should honour the flag if we're providing it + */ + if (!thisAllocator->enabled) + continue; + + result = thisAllocator->parse_sig_location( + resume_file, (toiNumAllocators == 1), + quiet); + + switch (result) { + case -EINVAL: + /* For this allocator, but not a valid + * configuration. Error already printed. */ + goto cleanup; + + case 0: + /* For this allocator and valid. */ + toiActiveAllocator = thisAllocator; + + set_toi_state(TOI_RESUME_DEVICE_OK); + set_toi_state(TOI_CAN_RESUME); + returning = 1; + goto cleanup; + } + } + if (!quiet) + printk(KERN_INFO "TuxOnIce: No matching enabled allocator " + "found. Resuming disabled.\n"); +cleanup: + toi_deactivate_storage(0); + return returning; +} + +void attempt_to_parse_resume_device2(void) +{ + toi_prepare_usm(); + toi_attempt_to_parse_resume_device(0); + toi_cleanup_usm(); +} + +void save_restore_alt_param(int replace, int quiet) +{ + static char resume_param_save[255]; + static unsigned long toi_state_save; + + if (replace) { + toi_state_save = toi_state; + strcpy(resume_param_save, resume_file); + strcpy(resume_file, alt_resume_param); + } else { + strcpy(resume_file, resume_param_save); + toi_state = toi_state_save; + } + toi_attempt_to_parse_resume_device(quiet); +} + +void attempt_to_parse_alt_resume_param(void) +{ + int ok = 0; + + /* Temporarily set resume_param to the poweroff value */ + if (!strlen(alt_resume_param)) + return; + + printk(KERN_INFO "=== Trying Poweroff Resume2 ===\n"); + save_restore_alt_param(SAVE, NOQUIET); + if (test_toi_state(TOI_CAN_RESUME)) + ok = 1; + + printk(KERN_INFO "=== Done ===\n"); + save_restore_alt_param(RESTORE, QUIET); + + /* If not ok, clear the string */ + if (ok) + return; + + printk(KERN_INFO "Can't resume from that location; clearing " + "alt_resume_param.\n"); + alt_resume_param[0] = '\0'; +} + +/** + * noresume_reset_modules - reset data structures in case of non resuming + * + * When we read the start of an image, modules (and especially the + * active allocator) might need to reset data structures if we + * decide to remove the image rather than resuming from it. + **/ +static void noresume_reset_modules(void) +{ + struct toi_module_ops *this_filter; + + list_for_each_entry(this_filter, &toi_filters, type_list) + if (this_filter->noresume_reset) + this_filter->noresume_reset(); + + if (toiActiveAllocator && toiActiveAllocator->noresume_reset) + toiActiveAllocator->noresume_reset(); +} + +/** + * fill_toi_header - fill the hibernate header structure + * @struct toi_header: Header data structure to be filled. + **/ +static int fill_toi_header(struct toi_header *sh) +{ + int i, error; + + error = init_header((struct swsusp_info *) sh); + if (error) + return error; + + sh->pagedir = pagedir1; + sh->pageset_2_size = pagedir2.size; + sh->param0 = toi_result; + sh->param1 = toi_bkd.toi_action; + sh->param2 = toi_bkd.toi_debug_state; + sh->param3 = toi_bkd.toi_default_console_level; + sh->root_fs = current->fs->root.mnt->mnt_sb->s_dev; + for (i = 0; i < 4; i++) + sh->io_time[i/2][i%2] = toi_bkd.toi_io_time[i/2][i%2]; + sh->bkd = boot_kernel_data_buffer; + return 0; +} + +/** + * rw_init_modules - initialize modules + * @rw: Whether we are reading of writing an image. + * @which: Section of the image being processed. + * + * Iterate over modules, preparing the ones that will be used to read or write + * data. + **/ +static int rw_init_modules(int rw, int which) +{ + struct toi_module_ops *this_module; + /* Initialise page transformers */ + list_for_each_entry(this_module, &toi_filters, type_list) { + if (!this_module->enabled) + continue; + if (this_module->rw_init && this_module->rw_init(rw, which)) { + abort_hibernate(TOI_FAILED_MODULE_INIT, + "Failed to initialize the %s filter.", + this_module->name); + return 1; + } + } + + /* Initialise allocator */ + if (toiActiveAllocator->rw_init(rw, which)) { + abort_hibernate(TOI_FAILED_MODULE_INIT, + "Failed to initialise the allocator."); + return 1; + } + + /* Initialise other modules */ + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + this_module->type == FILTER_MODULE || + this_module->type == WRITER_MODULE) + continue; + if (this_module->rw_init && this_module->rw_init(rw, which)) { + set_abort_result(TOI_FAILED_MODULE_INIT); + printk(KERN_INFO "Setting aborted flag due to module " + "init failure.\n"); + return 1; + } + } + + return 0; +} + +/** + * rw_cleanup_modules - cleanup modules + * @rw: Whether we are reading of writing an image. + * + * Cleanup components after reading or writing a set of pages. + * Only the allocator may fail. + **/ +static int rw_cleanup_modules(int rw) +{ + struct toi_module_ops *this_module; + int result = 0; + + /* Cleanup other modules */ + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + this_module->type == FILTER_MODULE || + this_module->type == WRITER_MODULE) + continue; + if (this_module->rw_cleanup) + result |= this_module->rw_cleanup(rw); + } + + /* Flush data and cleanup */ + list_for_each_entry(this_module, &toi_filters, type_list) { + if (!this_module->enabled) + continue; + if (this_module->rw_cleanup) + result |= this_module->rw_cleanup(rw); + } + + result |= toiActiveAllocator->rw_cleanup(rw); + + return result; +} + +static struct page *copy_page_from_orig_page(struct page *orig_page, int is_high) +{ + int index, min, max; + struct page *high_page = NULL, + **my_last_high_page = raw_cpu_ptr(&last_high_page), + **my_last_sought = raw_cpu_ptr(&last_sought); + struct pbe *this, **my_last_low_page = raw_cpu_ptr(&last_low_page); + void *compare; + + if (is_high) { + if (*my_last_sought && *my_last_high_page && + *my_last_sought < orig_page) + high_page = *my_last_high_page; + else + high_page = (struct page *) restore_highmem_pblist; + this = (struct pbe *) kmap(high_page); + compare = orig_page; + } else { + if (*my_last_sought && *my_last_low_page && + *my_last_sought < orig_page) + this = *my_last_low_page; + else + this = restore_pblist; + compare = page_address(orig_page); + } + + *my_last_sought = orig_page; + + /* Locate page containing pbe */ + while (this[PBES_PER_PAGE - 1].next && + this[PBES_PER_PAGE - 1].orig_address < compare) { + if (is_high) { + struct page *next_high_page = (struct page *) + this[PBES_PER_PAGE - 1].next; + kunmap(high_page); + this = kmap(next_high_page); + high_page = next_high_page; + } else + this = this[PBES_PER_PAGE - 1].next; + } + + /* Do a binary search within the page */ + min = 0; + max = PBES_PER_PAGE; + index = PBES_PER_PAGE / 2; + while (max - min) { + if (!this[index].orig_address || + this[index].orig_address > compare) + max = index; + else if (this[index].orig_address == compare) { + if (is_high) { + struct page *page = this[index].address; + *my_last_high_page = high_page; + kunmap(high_page); + return page; + } + *my_last_low_page = this; + return virt_to_page(this[index].address); + } else + min = index; + index = ((max + min) / 2); + }; + + if (is_high) + kunmap(high_page); + + abort_hibernate(TOI_FAILED_IO, "Failed to get destination page for" + " orig page %p. This[min].orig_address=%p.\n", orig_page, + this[index].orig_address); + return NULL; +} + +/** + * write_next_page - write the next page in a pageset + * @data_pfn: The pfn where the next data to write is located. + * @my_io_index: The index of the page in the pageset. + * @write_pfn: The pfn number to write in the image (where the data belongs). + * + * Get the pfn of the next page to write, map the page if necessary and do the + * write. + **/ +static int write_next_page(unsigned long *data_pfn, int *my_io_index, + unsigned long *write_pfn) +{ + struct page *page; + char **my_checksum_locn = raw_cpu_ptr(&checksum_locn); + int result = 0, was_present; + + *data_pfn = memory_bm_next_pfn(io_map, 0); + + /* Another thread could have beaten us to it. */ + if (*data_pfn == BM_END_OF_MAP) { + if (atomic_read(&io_count)) { + printk(KERN_INFO "Ran out of pfns but io_count is " + "still %d.\n", atomic_read(&io_count)); + BUG(); + } + mutex_unlock(&io_mutex); + return -ENODATA; + } + + *my_io_index = io_finish_at - atomic_sub_return(1, &io_count); + + memory_bm_clear_bit(io_map, 0, *data_pfn); + page = pfn_to_page(*data_pfn); + + was_present = kernel_page_present(page); + if (!was_present) + kernel_map_pages(page, 1, 1); + + if (io_pageset == 1) + *write_pfn = memory_bm_next_pfn(pageset1_map, 0); + else { + *write_pfn = *data_pfn; + *my_checksum_locn = tuxonice_get_next_checksum(); + } + + TOI_TRACE_DEBUG(*data_pfn, "_PS%d_write %d", io_pageset, *my_io_index); + + mutex_unlock(&io_mutex); + + if (io_pageset == 2 && tuxonice_calc_checksum(page, *my_checksum_locn)) + return 1; + + result = first_filter->write_page(*write_pfn, TOI_PAGE, page, + PAGE_SIZE); + + if (!was_present) + kernel_map_pages(page, 1, 0); + + return result; +} + +/** + * read_next_page - read the next page in a pageset + * @my_io_index: The index of the page in the pageset. + * @write_pfn: The pfn in which the data belongs. + * + * Read a page of the image into our buffer. It can happen (here and in the + * write routine) that threads don't get run until after other CPUs have done + * all the work. This was the cause of the long standing issue with + * occasionally getting -ENODATA errors at the end of reading the image. We + * therefore need to check there's actually a page to read before trying to + * retrieve one. + **/ + +static int read_next_page(int *my_io_index, unsigned long *write_pfn, + struct page *buffer) +{ + unsigned int buf_size = PAGE_SIZE; + unsigned long left = atomic_read(&io_count); + + if (!left) + return -ENODATA; + + /* Start off assuming the page we read isn't resaved */ + *my_io_index = io_finish_at - atomic_sub_return(1, &io_count); + + mutex_unlock(&io_mutex); + + /* + * Are we aborting? If so, don't submit any more I/O as + * resetting the resume_attempted flag (from ui.c) will + * clear the bdev flags, making this thread oops. + */ + if (unlikely(test_toi_state(TOI_STOP_RESUME))) { + atomic_dec(&toi_io_workers); + if (!atomic_read(&toi_io_workers)) { + /* + * So we can be sure we'll have memory for + * marking that we haven't resumed. + */ + rw_cleanup_modules(READ); + set_toi_state(TOI_IO_STOPPED); + } + while (1) + schedule(); + } + + /* + * See toi_bio_read_page in tuxonice_bio.c: + * read the next page in the image. + */ + return first_filter->read_page(write_pfn, TOI_PAGE, buffer, &buf_size); +} + +static void use_read_page(unsigned long write_pfn, struct page *buffer) +{ + struct page *final_page = pfn_to_page(write_pfn), + *copy_page = final_page; + char *virt, *buffer_virt; + int was_present, cpu = smp_processor_id(); + unsigned long idx = 0; + + if (io_pageset == 1 && (!pageset1_copy_map || + !memory_bm_test_bit(pageset1_copy_map, cpu, write_pfn))) { + int is_high = PageHighMem(final_page); + copy_page = copy_page_from_orig_page(is_high ? (void *) write_pfn : final_page, is_high); + } + + if (!memory_bm_test_bit(io_map, cpu, write_pfn)) { + int test = !memory_bm_test_bit(io_map, cpu, write_pfn); + toi_message(TOI_IO, TOI_VERBOSE, 0, "Discard %ld (%d).", write_pfn, test); + mutex_lock(&io_mutex); + idx = atomic_add_return(1, &io_count); + mutex_unlock(&io_mutex); + return; + } + + virt = kmap(copy_page); + buffer_virt = kmap(buffer); + was_present = kernel_page_present(copy_page); + if (!was_present) + kernel_map_pages(copy_page, 1, 1); + memcpy(virt, buffer_virt, PAGE_SIZE); + if (!was_present) + kernel_map_pages(copy_page, 1, 0); + kunmap(copy_page); + kunmap(buffer); + memory_bm_clear_bit(io_map, cpu, write_pfn); + TOI_TRACE_DEBUG(write_pfn, "_PS%d_read", io_pageset); +} + +static unsigned long status_update(int writing, unsigned long done, + unsigned long ticks) +{ + int cs_index = writing ? 0 : 1; + unsigned long ticks_so_far = toi_bkd.toi_io_time[cs_index][1] + ticks; + unsigned long msec = jiffies_to_msecs(abs(ticks_so_far)); + unsigned long pgs_per_s, estimate = 0, pages_left; + + if (msec) { + pages_left = io_barmax - done; + pgs_per_s = 1000 * done / msec; + if (pgs_per_s) + estimate = DIV_ROUND_UP(pages_left, pgs_per_s); + } + + if (estimate && ticks > HZ / 2) + return toi_update_status(done, io_barmax, + " %d/%d MB (%lu sec left)", + MB(done+1), MB(io_barmax), estimate); + + return toi_update_status(done, io_barmax, " %d/%d MB", + MB(done+1), MB(io_barmax)); +} + +/** + * worker_rw_loop - main loop to read/write pages + * + * The main I/O loop for reading or writing pages. The io_map bitmap is used to + * track the pages to read/write. + * If we are reading, the pages are loaded to their final (mapped) pfn. + * Data is non zero iff this is a thread started via start_other_threads. + * In that case, we stay in here until told to quit. + **/ +static int worker_rw_loop(void *data) +{ + unsigned long data_pfn, write_pfn, next_jiffies = jiffies + HZ / 4, + jif_index = 1, start_time = jiffies, thread_num; + int result = 0, my_io_index = 0, last_worker; + struct page *buffer = toi_alloc_page(28, TOI_ATOMIC_GFP); + cpumask_var_t orig_mask; + + if (!alloc_cpumask_var(&orig_mask, GFP_KERNEL)) { + printk(KERN_EMERG "Failed to allocate cpumask for TuxOnIce I/O thread %ld.\n", (unsigned long) data); + result = -ENOMEM; + goto out; + } + + cpumask_copy(orig_mask, tsk_cpus_allowed(current)); + + current->flags |= PF_NOFREEZE; + +top: + mutex_lock(&io_mutex); + thread_num = atomic_read(&toi_io_workers); + + cpumask_copy(tsk_cpus_allowed(current), orig_mask); + schedule(); + + atomic_inc(&toi_io_workers); + + while (atomic_read(&io_count) >= atomic_read(&toi_io_workers) && + !(io_write && test_result_state(TOI_ABORTED)) && + toi_worker_command == TOI_IO_WORKER_RUN) { + if (!thread_num && jiffies > next_jiffies) { + next_jiffies += HZ / 4; + if (toiActiveAllocator->update_throughput_throttle) + toiActiveAllocator->update_throughput_throttle( + jif_index); + jif_index++; + } + + /* + * What page to use? If reading, don't know yet which page's + * data will be read, so always use the buffer. If writing, + * use the copy (Pageset1) or original page (Pageset2), but + * always write the pfn of the original page. + */ + if (io_write) + result = write_next_page(&data_pfn, &my_io_index, + &write_pfn); + else /* Reading */ + result = read_next_page(&my_io_index, &write_pfn, + buffer); + + if (result) { + mutex_lock(&io_mutex); + /* Nothing to do? */ + if (result == -ENODATA) { + toi_message(TOI_IO, TOI_VERBOSE, 0, + "Thread %d has no more work.", + smp_processor_id()); + break; + } + + io_result = result; + + if (io_write) { + printk(KERN_INFO "Write chunk returned %d.\n", + result); + abort_hibernate(TOI_FAILED_IO, + "Failed to write a chunk of the " + "image."); + break; + } + + if (io_pageset == 1) { + printk(KERN_ERR "\nBreaking out of I/O loop " + "because of result code %d.\n", result); + break; + } + panic("Read chunk returned (%d)", result); + } + + /* + * Discard reads of resaved pages while reading ps2 + * and unwanted pages while rereading ps2 when aborting. + */ + if (!io_write) { + if (!PageResave(pfn_to_page(write_pfn))) + use_read_page(write_pfn, buffer); + else { + mutex_lock(&io_mutex); + toi_message(TOI_IO, TOI_VERBOSE, 0, + "Resaved %ld.", write_pfn); + atomic_inc(&io_count); + mutex_unlock(&io_mutex); + } + } + + if (!thread_num) { + if(my_io_index + io_base > io_nextupdate) + io_nextupdate = status_update(io_write, + my_io_index + io_base, + jiffies - start_time); + + if (my_io_index > io_pc) { + printk(KERN_CONT "...%d%%", 20 * io_pc_step); + io_pc_step++; + io_pc = io_finish_at * io_pc_step / 5; + } + } + + toi_cond_pause(0, NULL); + + /* + * Subtle: If there's less I/O still to be done than threads + * running, quit. This stops us doing I/O beyond the end of + * the image when reading. + * + * Possible race condition. Two threads could do the test at + * the same time; one should exit and one should continue. + * Therefore we take the mutex before comparing and exiting. + */ + + mutex_lock(&io_mutex); + } + + last_worker = atomic_dec_and_test(&toi_io_workers); + toi_message(TOI_IO, TOI_VERBOSE, 0, "%d workers left.", atomic_read(&toi_io_workers)); + mutex_unlock(&io_mutex); + + if ((unsigned long) data && toi_worker_command != TOI_IO_WORKER_EXIT) { + /* Were we the last thread and we're using a flusher thread? */ + if (last_worker && using_flusher) { + toiActiveAllocator->finish_all_io(); + } + /* First, if we're doing I/O, wait for it to finish */ + wait_event(toi_worker_wait_queue, toi_worker_command != TOI_IO_WORKER_RUN); + /* Then wait to be told what to do next */ + wait_event(toi_worker_wait_queue, toi_worker_command != TOI_IO_WORKER_STOP); + if (toi_worker_command == TOI_IO_WORKER_RUN) + goto top; + } + + if (thread_num) + atomic_dec(&toi_num_other_threads); + +out: + toi_message(TOI_IO, TOI_LOW, 0, "Thread %d exiting.", thread_num); + toi__free_page(28, buffer); + free_cpumask_var(orig_mask); + + return result; +} + +int toi_start_other_threads(void) +{ + int cpu; + struct task_struct *p; + int to_start = (toi_max_workers ? toi_max_workers : num_online_cpus()) - 1; + unsigned long num_started = 0; + + if (test_action_state(TOI_NO_MULTITHREADED_IO)) + return 0; + + toi_worker_command = TOI_IO_WORKER_STOP; + + for_each_online_cpu(cpu) { + if (num_started == to_start) + break; + + if (cpu == smp_processor_id()) + continue; + + p = kthread_create_on_node(worker_rw_loop, (void *) num_started + 1, + cpu_to_node(cpu), "ktoi_io/%d", cpu); + if (IS_ERR(p)) { + printk(KERN_ERR "ktoi_io for %i failed\n", cpu); + continue; + } + kthread_bind(p, cpu); + p->flags |= PF_MEMALLOC; + wake_up_process(p); + num_started++; + atomic_inc(&toi_num_other_threads); + } + + toi_message(TOI_IO, TOI_LOW, 0, "Started %d threads.", num_started); + return num_started; +} + +void toi_stop_other_threads(void) +{ + toi_message(TOI_IO, TOI_LOW, 0, "Stopping other threads."); + toi_worker_command = TOI_IO_WORKER_EXIT; + wake_up(&toi_worker_wait_queue); +} + +/** + * do_rw_loop - main highlevel function for reading or writing pages + * + * Create the io_map bitmap and call worker_rw_loop to perform I/O operations. + **/ +static int do_rw_loop(int write, int finish_at, struct memory_bitmap *pageflags, + int base, int barmax, int pageset) +{ + int index = 0, cpu, result = 0, workers_started; + unsigned long pfn, next; + + first_filter = toi_get_next_filter(NULL); + + if (!finish_at) + return 0; + + io_write = write; + io_finish_at = finish_at; + io_base = base; + io_barmax = barmax; + io_pageset = pageset; + io_index = 0; + io_pc = io_finish_at / 5; + io_pc_step = 1; + io_result = 0; + io_nextupdate = base + 1; + toi_bio_queue_flusher_should_finish = 0; + + for_each_online_cpu(cpu) { + per_cpu(last_sought, cpu) = NULL; + per_cpu(last_low_page, cpu) = NULL; + per_cpu(last_high_page, cpu) = NULL; + } + + /* Ensure all bits clear */ + memory_bm_clear(io_map); + + memory_bm_position_reset(io_map); + next = memory_bm_next_pfn(io_map, 0); + + BUG_ON(next != BM_END_OF_MAP); + + /* Set the bits for the pages to write */ + memory_bm_position_reset(pageflags); + + pfn = memory_bm_next_pfn(pageflags, 0); + toi_trace_index++; + + while (pfn != BM_END_OF_MAP && index < finish_at) { + TOI_TRACE_DEBUG(pfn, "_io_pageset_%d (%d/%d)", pageset, index + 1, finish_at); + memory_bm_set_bit(io_map, 0, pfn); + pfn = memory_bm_next_pfn(pageflags, 0); + index++; + } + + BUG_ON(next != BM_END_OF_MAP || index < finish_at); + + memory_bm_position_reset(io_map); + toi_trace_index++; + + atomic_set(&io_count, finish_at); + + memory_bm_position_reset(pageset1_map); + + mutex_lock(&io_mutex); + + clear_toi_state(TOI_IO_STOPPED); + + using_flusher = (atomic_read(&toi_num_other_threads) && + toiActiveAllocator->io_flusher && + !test_action_state(TOI_NO_FLUSHER_THREAD)); + + workers_started = atomic_read(&toi_num_other_threads); + + memory_bm_position_reset(io_map); + memory_bm_position_reset(pageset1_copy_map); + + toi_worker_command = TOI_IO_WORKER_RUN; + wake_up(&toi_worker_wait_queue); + + mutex_unlock(&io_mutex); + + if (using_flusher) + result = toiActiveAllocator->io_flusher(write); + else + worker_rw_loop(NULL); + + while (atomic_read(&toi_io_workers)) + schedule(); + + printk(KERN_CONT "\n"); + + toi_worker_command = TOI_IO_WORKER_STOP; + wake_up(&toi_worker_wait_queue); + + if (unlikely(test_toi_state(TOI_STOP_RESUME))) { + if (!atomic_read(&toi_io_workers)) { + rw_cleanup_modules(READ); + set_toi_state(TOI_IO_STOPPED); + } + while (1) + schedule(); + } + set_toi_state(TOI_IO_STOPPED); + + if (!io_result && !result && !test_result_state(TOI_ABORTED)) { + unsigned long next; + + toi_update_status(io_base + io_finish_at, io_barmax, + " %d/%d MB ", + MB(io_base + io_finish_at), MB(io_barmax)); + + memory_bm_position_reset(io_map); + next = memory_bm_next_pfn(io_map, 0); + if (next != BM_END_OF_MAP) { + printk(KERN_INFO "Finished I/O loop but still work to " + "do?\nFinish at = %d. io_count = %d.\n", + finish_at, atomic_read(&io_count)); + printk(KERN_INFO "I/O bitmap still records work to do." + "%ld.\n", next); + BUG(); + do { + cpu_relax(); + } while (0); + } + } + + return io_result ? io_result : result; +} + +/** + * write_pageset - write a pageset to disk. + * @pagedir: Which pagedir to write. + * + * Returns: + * Zero on success or -1 on failure. + **/ +int write_pageset(struct pagedir *pagedir) +{ + int finish_at, base = 0; + int barmax = pagedir1.size + pagedir2.size; + long error = 0; + struct memory_bitmap *pageflags; + unsigned long start_time, end_time; + + /* + * Even if there is nothing to read or write, the allocator + * may need the init/cleanup for it's housekeeping. (eg: + * Pageset1 may start where pageset2 ends when writing). + */ + finish_at = pagedir->size; + + if (pagedir->id == 1) { + toi_prepare_status(DONT_CLEAR_BAR, + "Writing kernel & process data..."); + base = pagedir2.size; + if (test_action_state(TOI_TEST_FILTER_SPEED) || + test_action_state(TOI_TEST_BIO)) + pageflags = pageset1_map; + else + pageflags = pageset1_copy_map; + } else { + toi_prepare_status(DONT_CLEAR_BAR, "Writing caches..."); + pageflags = pageset2_map; + } + + start_time = jiffies; + + if (rw_init_modules(WRITE, pagedir->id)) { + abort_hibernate(TOI_FAILED_MODULE_INIT, + "Failed to initialise modules for writing."); + error = 1; + } + + if (!error) + error = do_rw_loop(WRITE, finish_at, pageflags, base, barmax, + pagedir->id); + + if (rw_cleanup_modules(WRITE) && !error) { + abort_hibernate(TOI_FAILED_MODULE_CLEANUP, + "Failed to cleanup after writing."); + error = 1; + } + + end_time = jiffies; + + if ((end_time - start_time) && (!test_result_state(TOI_ABORTED))) { + toi_bkd.toi_io_time[0][0] += finish_at, + toi_bkd.toi_io_time[0][1] += (end_time - start_time); + } + + return error; +} + +/** + * read_pageset - highlevel function to read a pageset from disk + * @pagedir: pageset to read + * @overwrittenpagesonly: Whether to read the whole pageset or + * only part of it. + * + * Returns: + * Zero on success or -1 on failure. + **/ +static int read_pageset(struct pagedir *pagedir, int overwrittenpagesonly) +{ + int result = 0, base = 0; + int finish_at = pagedir->size; + int barmax = pagedir1.size + pagedir2.size; + struct memory_bitmap *pageflags; + unsigned long start_time, end_time; + + if (pagedir->id == 1) { + toi_prepare_status(DONT_CLEAR_BAR, + "Reading kernel & process data..."); + pageflags = pageset1_map; + } else { + toi_prepare_status(DONT_CLEAR_BAR, "Reading caches..."); + if (overwrittenpagesonly) { + barmax = min(pagedir1.size, pagedir2.size); + finish_at = min(pagedir1.size, pagedir2.size); + } else + base = pagedir1.size; + pageflags = pageset2_map; + } + + start_time = jiffies; + + if (rw_init_modules(READ, pagedir->id)) { + toiActiveAllocator->remove_image(); + result = 1; + } else + result = do_rw_loop(READ, finish_at, pageflags, base, barmax, + pagedir->id); + + if (rw_cleanup_modules(READ) && !result) { + abort_hibernate(TOI_FAILED_MODULE_CLEANUP, + "Failed to cleanup after reading."); + result = 1; + } + + /* Statistics */ + end_time = jiffies; + + if ((end_time - start_time) && (!test_result_state(TOI_ABORTED))) { + toi_bkd.toi_io_time[1][0] += finish_at, + toi_bkd.toi_io_time[1][1] += (end_time - start_time); + } + + return result; +} + +/** + * write_module_configs - store the modules configuration + * + * The configuration for each module is stored in the image header. + * Returns: Int + * Zero on success, Error value otherwise. + **/ +static int write_module_configs(void) +{ + struct toi_module_ops *this_module; + char *buffer = (char *) toi_get_zeroed_page(22, TOI_ATOMIC_GFP); + int len, index = 1; + struct toi_module_header toi_module_header; + + if (!buffer) { + printk(KERN_INFO "Failed to allocate a buffer for saving " + "module configuration info.\n"); + return -ENOMEM; + } + + /* + * We have to know which data goes with which module, so we at + * least write a length of zero for a module. Note that we are + * also assuming every module's config data takes <= PAGE_SIZE. + */ + + /* For each module (in registration order) */ + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || !this_module->storage_needed || + (this_module->type == WRITER_MODULE && + toiActiveAllocator != this_module)) + continue; + + /* Get the data from the module */ + len = 0; + if (this_module->save_config_info) + len = this_module->save_config_info(buffer); + + /* Save the details of the module */ + toi_module_header.enabled = this_module->enabled; + toi_module_header.type = this_module->type; + toi_module_header.index = index++; + strncpy(toi_module_header.name, this_module->name, + sizeof(toi_module_header.name)); + toiActiveAllocator->rw_header_chunk(WRITE, + this_module, + (char *) &toi_module_header, + sizeof(toi_module_header)); + + /* Save the size of the data and any data returned */ + toiActiveAllocator->rw_header_chunk(WRITE, + this_module, + (char *) &len, sizeof(int)); + if (len) + toiActiveAllocator->rw_header_chunk( + WRITE, this_module, buffer, len); + } + + /* Write a blank header to terminate the list */ + toi_module_header.name[0] = '\0'; + toiActiveAllocator->rw_header_chunk(WRITE, NULL, + (char *) &toi_module_header, sizeof(toi_module_header)); + + toi_free_page(22, (unsigned long) buffer); + return 0; +} + +/** + * read_one_module_config - read and configure one module + * + * Read the configuration for one module, and configure the module + * to match if it is loaded. + * + * Returns: Int + * Zero on success, Error value otherwise. + **/ +static int read_one_module_config(struct toi_module_header *header) +{ + struct toi_module_ops *this_module; + int result, len; + char *buffer; + + /* Find the module */ + this_module = toi_find_module_given_name(header->name); + + if (!this_module) { + if (header->enabled) { + toi_early_boot_message(1, TOI_CONTINUE_REQ, + "It looks like we need module %s for reading " + "the image but it hasn't been registered.\n", + header->name); + if (!(test_toi_state(TOI_CONTINUE_REQ))) + return -EINVAL; + } else + printk(KERN_INFO "Module %s configuration data found, " + "but the module hasn't registered. Looks like " + "it was disabled, so we're ignoring its data.", + header->name); + } + + /* Get the length of the data (if any) */ + result = toiActiveAllocator->rw_header_chunk(READ, NULL, (char *) &len, + sizeof(int)); + if (result) { + printk(KERN_ERR "Failed to read the length of the module %s's" + " configuration data.\n", + header->name); + return -EINVAL; + } + + /* Read any data and pass to the module (if we found one) */ + if (!len) + return 0; + + buffer = (char *) toi_get_zeroed_page(23, TOI_ATOMIC_GFP); + + if (!buffer) { + printk(KERN_ERR "Failed to allocate a buffer for reloading " + "module configuration info.\n"); + return -ENOMEM; + } + + toiActiveAllocator->rw_header_chunk(READ, NULL, buffer, len); + + if (!this_module) + goto out; + + if (!this_module->save_config_info) + printk(KERN_ERR "Huh? Module %s appears to have a " + "save_config_info, but not a load_config_info " + "function!\n", this_module->name); + else + this_module->load_config_info(buffer, len); + + /* + * Now move this module to the tail of its lists. This will put it in + * order. Any new modules will end up at the top of the lists. They + * should have been set to disabled when loaded (people will + * normally not edit an initrd to load a new module and then hibernate + * without using it!). + */ + + toi_move_module_tail(this_module); + + this_module->enabled = header->enabled; + +out: + toi_free_page(23, (unsigned long) buffer); + return 0; +} + +/** + * read_module_configs - reload module configurations from the image header. + * + * Returns: Int + * Zero on success or an error code. + **/ +static int read_module_configs(void) +{ + int result = 0; + struct toi_module_header toi_module_header; + struct toi_module_ops *this_module; + + /* All modules are initially disabled. That way, if we have a module + * loaded now that wasn't loaded when we hibernated, it won't be used + * in trying to read the data. + */ + list_for_each_entry(this_module, &toi_modules, module_list) + this_module->enabled = 0; + + /* Get the first module header */ + result = toiActiveAllocator->rw_header_chunk(READ, NULL, + (char *) &toi_module_header, + sizeof(toi_module_header)); + if (result) { + printk(KERN_ERR "Failed to read the next module header.\n"); + return -EINVAL; + } + + /* For each module (in registration order) */ + while (toi_module_header.name[0]) { + result = read_one_module_config(&toi_module_header); + + if (result) + return -EINVAL; + + /* Get the next module header */ + result = toiActiveAllocator->rw_header_chunk(READ, NULL, + (char *) &toi_module_header, + sizeof(toi_module_header)); + + if (result) { + printk(KERN_ERR "Failed to read the next module " + "header.\n"); + return -EINVAL; + } + } + + return 0; +} + +static inline int save_fs_info(struct fs_info *fs, struct block_device *bdev) +{ + return (!fs || IS_ERR(fs) || !fs->last_mount_size) ? 0 : 1; +} + +int fs_info_space_needed(void) +{ + const struct super_block *sb; + int result = sizeof(int); + + list_for_each_entry(sb, &super_blocks, s_list) { + struct fs_info *fs; + + if (!sb->s_bdev) + continue; + + fs = fs_info_from_block_dev(sb->s_bdev); + if (save_fs_info(fs, sb->s_bdev)) + result += 16 + sizeof(dev_t) + sizeof(int) + + fs->last_mount_size; + free_fs_info(fs); + } + return result; +} + +static int fs_info_num_to_save(void) +{ + const struct super_block *sb; + int to_save = 0; + + list_for_each_entry(sb, &super_blocks, s_list) { + struct fs_info *fs; + + if (!sb->s_bdev) + continue; + + fs = fs_info_from_block_dev(sb->s_bdev); + if (save_fs_info(fs, sb->s_bdev)) + to_save++; + free_fs_info(fs); + } + + return to_save; +} + +static int fs_info_save(void) +{ + const struct super_block *sb; + int to_save = fs_info_num_to_save(); + + if (toiActiveAllocator->rw_header_chunk(WRITE, NULL, (char *) &to_save, + sizeof(int))) { + abort_hibernate(TOI_FAILED_IO, "Failed to write num fs_info" + " to save."); + return -EIO; + } + + list_for_each_entry(sb, &super_blocks, s_list) { + struct fs_info *fs; + + if (!sb->s_bdev) + continue; + + fs = fs_info_from_block_dev(sb->s_bdev); + if (save_fs_info(fs, sb->s_bdev)) { + if (toiActiveAllocator->rw_header_chunk(WRITE, NULL, + &fs->uuid[0], 16)) { + abort_hibernate(TOI_FAILED_IO, "Failed to " + "write uuid."); + return -EIO; + } + if (toiActiveAllocator->rw_header_chunk(WRITE, NULL, + (char *) &fs->dev_t, sizeof(dev_t))) { + abort_hibernate(TOI_FAILED_IO, "Failed to " + "write dev_t."); + return -EIO; + } + if (toiActiveAllocator->rw_header_chunk(WRITE, NULL, + (char *) &fs->last_mount_size, sizeof(int))) { + abort_hibernate(TOI_FAILED_IO, "Failed to " + "write last mount length."); + return -EIO; + } + if (toiActiveAllocator->rw_header_chunk(WRITE, NULL, + fs->last_mount, fs->last_mount_size)) { + abort_hibernate(TOI_FAILED_IO, "Failed to " + "write uuid."); + return -EIO; + } + } + free_fs_info(fs); + } + return 0; +} + +static int fs_info_load_and_check_one(void) +{ + char uuid[16], *last_mount; + int result = 0, ln; + dev_t dev_t; + struct block_device *dev; + struct fs_info *fs_info, seek; + + if (toiActiveAllocator->rw_header_chunk(READ, NULL, uuid, 16)) { + abort_hibernate(TOI_FAILED_IO, "Failed to read uuid."); + return -EIO; + } + + read_if_version(3, dev_t, "uuid dev_t field", return -EIO); + + if (toiActiveAllocator->rw_header_chunk(READ, NULL, (char *) &ln, + sizeof(int))) { + abort_hibernate(TOI_FAILED_IO, + "Failed to read last mount size."); + return -EIO; + } + + last_mount = kzalloc(ln, GFP_KERNEL); + + if (!last_mount) + return -ENOMEM; + + if (toiActiveAllocator->rw_header_chunk(READ, NULL, last_mount, ln)) { + abort_hibernate(TOI_FAILED_IO, + "Failed to read last mount timestamp."); + result = -EIO; + goto out_lmt; + } + + strncpy((char *) &seek.uuid, uuid, 16); + seek.dev_t = dev_t; + seek.last_mount_size = ln; + seek.last_mount = last_mount; + dev_t = blk_lookup_fs_info(&seek); + if (!dev_t) + goto out_lmt; + + dev = toi_open_by_devnum(dev_t); + + fs_info = fs_info_from_block_dev(dev); + if (fs_info && !IS_ERR(fs_info)) { + if (ln != fs_info->last_mount_size) { + printk(KERN_EMERG "Found matching uuid but last mount " + "time lengths differ?! " + "(%d vs %d).\n", ln, + fs_info->last_mount_size); + result = -EINVAL; + } else { + char buf[BDEVNAME_SIZE]; + result = !!memcmp(fs_info->last_mount, last_mount, ln); + if (result) + printk(KERN_EMERG "Last mount time for %s has " + "changed!\n", bdevname(dev, buf)); + } + } + toi_close_bdev(dev); + free_fs_info(fs_info); +out_lmt: + kfree(last_mount); + return result; +} + +static int fs_info_load_and_check(void) +{ + int to_do, result = 0; + + if (toiActiveAllocator->rw_header_chunk(READ, NULL, (char *) &to_do, + sizeof(int))) { + abort_hibernate(TOI_FAILED_IO, "Failed to read num fs_info " + "to load."); + return -EIO; + } + + while(to_do--) + result |= fs_info_load_and_check_one(); + + return result; +} + +/** + * write_image_header - write the image header after write the image proper + * + * Returns: Int + * Zero on success, error value otherwise. + **/ +int write_image_header(void) +{ + int ret; + int total = pagedir1.size + pagedir2.size+2; + char *header_buffer = NULL; + + /* Now prepare to write the header */ + ret = toiActiveAllocator->write_header_init(); + if (ret) { + abort_hibernate(TOI_FAILED_MODULE_INIT, + "Active allocator's write_header_init" + " function failed."); + goto write_image_header_abort; + } + + /* Get a buffer */ + header_buffer = (char *) toi_get_zeroed_page(24, TOI_ATOMIC_GFP); + if (!header_buffer) { + abort_hibernate(TOI_OUT_OF_MEMORY, + "Out of memory when trying to get page for header!"); + goto write_image_header_abort; + } + + /* Write hibernate header */ + if (fill_toi_header((struct toi_header *) header_buffer)) { + abort_hibernate(TOI_OUT_OF_MEMORY, + "Failure to fill header information!"); + goto write_image_header_abort; + } + + if (toiActiveAllocator->rw_header_chunk(WRITE, NULL, + header_buffer, sizeof(struct toi_header))) { + abort_hibernate(TOI_OUT_OF_MEMORY, + "Failure to write header info."); + goto write_image_header_abort; + } + + if (toiActiveAllocator->rw_header_chunk(WRITE, NULL, + (char *) &toi_max_workers, sizeof(toi_max_workers))) { + abort_hibernate(TOI_OUT_OF_MEMORY, + "Failure to number of workers to use."); + goto write_image_header_abort; + } + + /* Write filesystem info */ + if (fs_info_save()) + goto write_image_header_abort; + + /* Write module configurations */ + ret = write_module_configs(); + if (ret) { + abort_hibernate(TOI_FAILED_IO, + "Failed to write module configs."); + goto write_image_header_abort; + } + + if (memory_bm_write(pageset1_map, + toiActiveAllocator->rw_header_chunk)) { + abort_hibernate(TOI_FAILED_IO, + "Failed to write bitmaps."); + goto write_image_header_abort; + } + + /* Flush data and let allocator cleanup */ + if (toiActiveAllocator->write_header_cleanup()) { + abort_hibernate(TOI_FAILED_IO, + "Failed to cleanup writing header."); + goto write_image_header_abort_no_cleanup; + } + + if (test_result_state(TOI_ABORTED)) + goto write_image_header_abort_no_cleanup; + + toi_update_status(total, total, NULL); + +out: + if (header_buffer) + toi_free_page(24, (unsigned long) header_buffer); + return ret; + +write_image_header_abort: + toiActiveAllocator->write_header_cleanup(); +write_image_header_abort_no_cleanup: + ret = -1; + goto out; +} + +/** + * sanity_check - check the header + * @sh: the header which was saved at hibernate time. + * + * Perform a few checks, seeking to ensure that the kernel being + * booted matches the one hibernated. They need to match so we can + * be _sure_ things will work. It is not absolutely impossible for + * resuming from a different kernel to work, just not assured. + **/ +static char *sanity_check(struct toi_header *sh) +{ + char *reason = check_image_kernel((struct swsusp_info *) sh); + + if (reason) + return reason; + + if (!test_action_state(TOI_IGNORE_ROOTFS)) { + const struct super_block *sb; + list_for_each_entry(sb, &super_blocks, s_list) { + if ((!(sb->s_flags & MS_RDONLY)) && + (sb->s_type->fs_flags & FS_REQUIRES_DEV)) + return "Device backed fs has been mounted " + "rw prior to resume or initrd/ramfs " + "is mounted rw."; + } + } + + return NULL; +} + +static DECLARE_WAIT_QUEUE_HEAD(freeze_wait); + +#define FREEZE_IN_PROGRESS (~0) + +static int freeze_result; + +static void do_freeze(struct work_struct *dummy) +{ + freeze_result = freeze_processes(); + wake_up(&freeze_wait); + trap_non_toi_io = 1; +} + +static DECLARE_WORK(freeze_work, do_freeze); + +/** + * __read_pageset1 - test for the existence of an image and attempt to load it + * + * Returns: Int + * Zero if image found and pageset1 successfully loaded. + * Error if no image found or loaded. + **/ +static int __read_pageset1(void) +{ + int i, result = 0; + char *header_buffer = (char *) toi_get_zeroed_page(25, TOI_ATOMIC_GFP), + *sanity_error = NULL; + struct toi_header *toi_header; + + if (!header_buffer) { + printk(KERN_INFO "Unable to allocate a page for reading the " + "signature.\n"); + return -ENOMEM; + } + + /* Check for an image */ + result = toiActiveAllocator->image_exists(1); + if (result == 3) { + result = -ENODATA; + toi_early_boot_message(1, 0, "The signature from an older " + "version of TuxOnIce has been detected."); + goto out_remove_image; + } + + if (result != 1) { + result = -ENODATA; + noresume_reset_modules(); + printk(KERN_INFO "TuxOnIce: No image found.\n"); + goto out; + } + + /* + * Prepare the active allocator for reading the image header. The + * activate allocator might read its own configuration. + * + * NB: This call may never return because there might be a signature + * for a different image such that we warn the user and they choose + * to reboot. (If the device ids look erroneous (2.4 vs 2.6) or the + * location of the image might be unavailable if it was stored on a + * network connection). + */ + + result = toiActiveAllocator->read_header_init(); + if (result) { + printk(KERN_INFO "TuxOnIce: Failed to initialise, reading the " + "image header.\n"); + goto out_remove_image; + } + + /* Check for noresume command line option */ + if (test_toi_state(TOI_NORESUME_SPECIFIED)) { + printk(KERN_INFO "TuxOnIce: Noresume on command line. Removed " + "image.\n"); + goto out_remove_image; + } + + /* Check whether we've resumed before */ + if (test_toi_state(TOI_RESUMED_BEFORE)) { + toi_early_boot_message(1, 0, NULL); + if (!(test_toi_state(TOI_CONTINUE_REQ))) { + printk(KERN_INFO "TuxOnIce: Tried to resume before: " + "Invalidated image.\n"); + goto out_remove_image; + } + } + + clear_toi_state(TOI_CONTINUE_REQ); + + toi_image_header_version = toiActiveAllocator->get_header_version(); + + if (unlikely(toi_image_header_version > TOI_HEADER_VERSION)) { + toi_early_boot_message(1, 0, image_version_error); + if (!(test_toi_state(TOI_CONTINUE_REQ))) { + printk(KERN_INFO "TuxOnIce: Header version too new: " + "Invalidated image.\n"); + goto out_remove_image; + } + } + + /* Read hibernate header */ + result = toiActiveAllocator->rw_header_chunk(READ, NULL, + header_buffer, sizeof(struct toi_header)); + if (result < 0) { + printk(KERN_ERR "TuxOnIce: Failed to read the image " + "signature.\n"); + goto out_remove_image; + } + + toi_header = (struct toi_header *) header_buffer; + + /* + * NB: This call may also result in a reboot rather than returning. + */ + + sanity_error = sanity_check(toi_header); + if (sanity_error) { + toi_early_boot_message(1, TOI_CONTINUE_REQ, + sanity_error); + printk(KERN_INFO "TuxOnIce: Sanity check failed.\n"); + goto out_remove_image; + } + + /* + * We have an image and it looks like it will load okay. + * + * Get metadata from header. Don't override commandline parameters. + * + * We don't need to save the image size limit because it's not used + * during resume and will be restored with the image anyway. + */ + + memcpy((char *) &pagedir1, + (char *) &toi_header->pagedir, sizeof(pagedir1)); + toi_result = toi_header->param0; + if (!toi_bkd.toi_debug_state) { + toi_bkd.toi_action = + (toi_header->param1 & ~toi_bootflags_mask) | + (toi_bkd.toi_action & toi_bootflags_mask); + toi_bkd.toi_debug_state = toi_header->param2; + toi_bkd.toi_default_console_level = toi_header->param3; + } + clear_toi_state(TOI_IGNORE_LOGLEVEL); + pagedir2.size = toi_header->pageset_2_size; + for (i = 0; i < 4; i++) + toi_bkd.toi_io_time[i/2][i%2] = + toi_header->io_time[i/2][i%2]; + + set_toi_state(TOI_BOOT_KERNEL); + boot_kernel_data_buffer = toi_header->bkd; + + read_if_version(1, toi_max_workers, "TuxOnIce max workers", + goto out_remove_image); + + /* Read filesystem info */ + if (fs_info_load_and_check()) { + printk(KERN_EMERG "TuxOnIce: File system mount time checks " + "failed. Refusing to corrupt your filesystems!\n"); + goto out_remove_image; + } + + /* Read module configurations */ + result = read_module_configs(); + if (result) { + pagedir1.size = 0; + pagedir2.size = 0; + printk(KERN_INFO "TuxOnIce: Failed to read TuxOnIce module " + "configurations.\n"); + clear_action_state(TOI_KEEP_IMAGE); + goto out_remove_image; + } + + toi_prepare_console(); + + set_toi_state(TOI_NOW_RESUMING); + + result = pm_notifier_call_chain(PM_RESTORE_PREPARE); + if (result) + goto out_notifier_call_chain;; + + if (usermodehelper_disable()) + goto out_enable_usermodehelper; + + current->flags |= PF_NOFREEZE; + freeze_result = FREEZE_IN_PROGRESS; + + schedule_work_on(cpumask_first(cpu_online_mask), &freeze_work); + + toi_cond_pause(1, "About to read original pageset1 locations."); + + /* + * See _toi_rw_header_chunk in tuxonice_bio.c: + * Initialize pageset1_map by reading the map from the image. + */ + if (memory_bm_read(pageset1_map, toiActiveAllocator->rw_header_chunk)) + goto out_thaw; + + /* + * See toi_rw_cleanup in tuxonice_bio.c: + * Clean up after reading the header. + */ + result = toiActiveAllocator->read_header_cleanup(); + if (result) { + printk(KERN_ERR "TuxOnIce: Failed to cleanup after reading the " + "image header.\n"); + goto out_thaw; + } + + toi_cond_pause(1, "About to read pagedir."); + + /* + * Get the addresses of pages into which we will load the kernel to + * be copied back and check if they conflict with the ones we are using. + */ + if (toi_get_pageset1_load_addresses()) { + printk(KERN_INFO "TuxOnIce: Failed to get load addresses for " + "pageset1.\n"); + goto out_thaw; + } + + /* Read the original kernel back */ + toi_cond_pause(1, "About to read pageset 1."); + + /* Given the pagemap, read back the data from disk */ + if (read_pageset(&pagedir1, 0)) { + toi_prepare_status(DONT_CLEAR_BAR, "Failed to read pageset 1."); + result = -EIO; + goto out_thaw; + } + + toi_cond_pause(1, "About to restore original kernel."); + result = 0; + + if (!toi_keeping_image && + toiActiveAllocator->mark_resume_attempted) + toiActiveAllocator->mark_resume_attempted(1); + + wait_event(freeze_wait, freeze_result != FREEZE_IN_PROGRESS); +out: + current->flags &= ~PF_NOFREEZE; + toi_free_page(25, (unsigned long) header_buffer); + return result; + +out_thaw: + wait_event(freeze_wait, freeze_result != FREEZE_IN_PROGRESS); + trap_non_toi_io = 0; + thaw_processes(); +out_enable_usermodehelper: + usermodehelper_enable(); +out_notifier_call_chain: + pm_notifier_call_chain(PM_POST_RESTORE); + toi_cleanup_console(); +out_remove_image: + result = -EINVAL; + if (!toi_keeping_image) + toiActiveAllocator->remove_image(); + toiActiveAllocator->read_header_cleanup(); + noresume_reset_modules(); + goto out; +} + +/** + * read_pageset1 - highlevel function to read the saved pages + * + * Attempt to read the header and pageset1 of a hibernate image. + * Handle the outcome, complaining where appropriate. + **/ +int read_pageset1(void) +{ + int error; + + error = __read_pageset1(); + + if (error && error != -ENODATA && error != -EINVAL && + !test_result_state(TOI_ABORTED)) + abort_hibernate(TOI_IMAGE_ERROR, + "TuxOnIce: Error %d resuming\n", error); + + return error; +} + +/** + * get_have_image_data - check the image header + **/ +static char *get_have_image_data(void) +{ + char *output_buffer = (char *) toi_get_zeroed_page(26, TOI_ATOMIC_GFP); + struct toi_header *toi_header; + + if (!output_buffer) { + printk(KERN_INFO "Output buffer null.\n"); + return NULL; + } + + /* Check for an image */ + if (!toiActiveAllocator->image_exists(1) || + toiActiveAllocator->read_header_init() || + toiActiveAllocator->rw_header_chunk(READ, NULL, + output_buffer, sizeof(struct toi_header))) { + sprintf(output_buffer, "0\n"); + /* + * From an initrd/ramfs, catting have_image and + * getting a result of 0 is sufficient. + */ + clear_toi_state(TOI_BOOT_TIME); + goto out; + } + + toi_header = (struct toi_header *) output_buffer; + + sprintf(output_buffer, "1\n%s\n%s\n", + toi_header->uts.machine, + toi_header->uts.version); + + /* Check whether we've resumed before */ + if (test_toi_state(TOI_RESUMED_BEFORE)) + strcat(output_buffer, "Resumed before.\n"); + +out: + noresume_reset_modules(); + return output_buffer; +} + +/** + * read_pageset2 - read second part of the image + * @overwrittenpagesonly: Read only pages which would have been + * verwritten by pageset1? + * + * Read in part or all of pageset2 of an image, depending upon + * whether we are hibernating and have only overwritten a portion + * with pageset1 pages, or are resuming and need to read them + * all. + * + * Returns: Int + * Zero if no error, otherwise the error value. + **/ +int read_pageset2(int overwrittenpagesonly) +{ + int result = 0; + + if (!pagedir2.size) + return 0; + + result = read_pageset(&pagedir2, overwrittenpagesonly); + + toi_cond_pause(1, "Pagedir 2 read."); + + return result; +} + +/** + * image_exists_read - has an image been found? + * @page: Output buffer + * + * Store 0 or 1 in page, depending on whether an image is found. + * Incoming buffer is PAGE_SIZE and result is guaranteed + * to be far less than that, so we don't worry about + * overflow. + **/ +int image_exists_read(const char *page, int count) +{ + int len = 0; + char *result; + + if (toi_activate_storage(0)) + return count; + + if (!test_toi_state(TOI_RESUME_DEVICE_OK)) + toi_attempt_to_parse_resume_device(0); + + if (!toiActiveAllocator) { + len = sprintf((char *) page, "-1\n"); + } else { + result = get_have_image_data(); + if (result) { + len = sprintf((char *) page, "%s", result); + toi_free_page(26, (unsigned long) result); + } + } + + toi_deactivate_storage(0); + + return len; +} + +/** + * image_exists_write - invalidate an image if one exists + **/ +int image_exists_write(const char *buffer, int count) +{ + if (toi_activate_storage(0)) + return count; + + if (toiActiveAllocator && toiActiveAllocator->image_exists(1)) + toiActiveAllocator->remove_image(); + + toi_deactivate_storage(0); + + clear_result_state(TOI_KEPT_IMAGE); + + return count; +} diff --git b/kernel/power/tuxonice_io.h b/kernel/power/tuxonice_io.h new file mode 100644 index 0000000..683eab7 --- /dev/null +++ b/kernel/power/tuxonice_io.h @@ -0,0 +1,72 @@ +/* + * kernel/power/tuxonice_io.h + * + * Copyright (C) 2005-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * It contains high level IO routines for hibernating. + * + */ + +#include +#include "tuxonice_pagedir.h" + +/* Non-module data saved in our image header */ +struct toi_header { + /* + * Mirror struct swsusp_info, but without + * the page aligned attribute + */ + struct new_utsname uts; + u32 version_code; + unsigned long num_physpages; + int cpus; + unsigned long image_pages; + unsigned long pages; + unsigned long size; + + /* Our own data */ + unsigned long orig_mem_free; + int page_size; + int pageset_2_size; + int param0; + int param1; + int param2; + int param3; + int progress0; + int progress1; + int progress2; + int progress3; + int io_time[2][2]; + struct pagedir pagedir; + dev_t root_fs; + unsigned long bkd; /* Boot kernel data locn */ +}; + +extern int write_pageset(struct pagedir *pagedir); +extern int write_image_header(void); +extern int read_pageset1(void); +extern int read_pageset2(int overwrittenpagesonly); + +extern int toi_attempt_to_parse_resume_device(int quiet); +extern void attempt_to_parse_resume_device2(void); +extern void attempt_to_parse_alt_resume_param(void); +int image_exists_read(const char *page, int count); +int image_exists_write(const char *buffer, int count); +extern void save_restore_alt_param(int replace, int quiet); +extern atomic_t toi_io_workers; + +/* Args to save_restore_alt_param */ +#define RESTORE 0 +#define SAVE 1 + +#define NOQUIET 0 +#define QUIET 1 + +extern wait_queue_head_t toi_io_queue_flusher; +extern int toi_bio_queue_flusher_should_finish; + +int fs_info_space_needed(void); + +extern int toi_max_workers; diff --git b/kernel/power/tuxonice_modules.c b/kernel/power/tuxonice_modules.c new file mode 100644 index 0000000..a203c8f --- /dev/null +++ b/kernel/power/tuxonice_modules.c @@ -0,0 +1,520 @@ +/* + * kernel/power/tuxonice_modules.c + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + */ + +#include +#include +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_ui.h" + +LIST_HEAD(toi_filters); +LIST_HEAD(toiAllocators); + +LIST_HEAD(toi_modules); + +struct toi_module_ops *toiActiveAllocator; + +static int toi_num_filters; +int toiNumAllocators, toi_num_modules; + +/* + * toi_header_storage_for_modules + * + * Returns the amount of space needed to store configuration + * data needed by the modules prior to copying back the original + * kernel. We can exclude data for pageset2 because it will be + * available anyway once the kernel is copied back. + */ +long toi_header_storage_for_modules(void) +{ + struct toi_module_ops *this_module; + int bytes = 0; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + (this_module->type == WRITER_MODULE && + toiActiveAllocator != this_module)) + continue; + if (this_module->storage_needed) { + int this = this_module->storage_needed() + + sizeof(struct toi_module_header) + + sizeof(int); + this_module->header_requested = this; + bytes += this; + } + } + + /* One more for the empty terminator */ + return bytes + sizeof(struct toi_module_header); +} + +void print_toi_header_storage_for_modules(void) +{ + struct toi_module_ops *this_module; + int bytes = 0; + + printk(KERN_DEBUG "Header storage:\n"); + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + (this_module->type == WRITER_MODULE && + toiActiveAllocator != this_module)) + continue; + if (this_module->storage_needed) { + int this = this_module->storage_needed() + + sizeof(struct toi_module_header) + + sizeof(int); + this_module->header_requested = this; + bytes += this; + printk(KERN_DEBUG "+ %16s : %-4d/%d.\n", + this_module->name, + this_module->header_used, this); + } + } + + printk(KERN_DEBUG "+ empty terminator : %zu.\n", + sizeof(struct toi_module_header)); + printk(KERN_DEBUG " ====\n"); + printk(KERN_DEBUG " %zu\n", + bytes + sizeof(struct toi_module_header)); +} + +/* + * toi_memory_for_modules + * + * Returns the amount of memory requested by modules for + * doing their work during the cycle. + */ + +long toi_memory_for_modules(int print_parts) +{ + long bytes = 0, result; + struct toi_module_ops *this_module; + + if (print_parts) + printk(KERN_INFO "Memory for modules:\n===================\n"); + list_for_each_entry(this_module, &toi_modules, module_list) { + int this; + if (!this_module->enabled) + continue; + if (this_module->memory_needed) { + this = this_module->memory_needed(); + if (print_parts) + printk(KERN_INFO "%10d bytes (%5ld pages) for " + "module '%s'.\n", this, + DIV_ROUND_UP(this, PAGE_SIZE), + this_module->name); + bytes += this; + } + } + + result = DIV_ROUND_UP(bytes, PAGE_SIZE); + if (print_parts) + printk(KERN_INFO " => %ld bytes, %ld pages.\n", bytes, result); + + return result; +} + +/* + * toi_expected_compression_ratio + * + * Returns the compression ratio expected when saving the image. + */ + +int toi_expected_compression_ratio(void) +{ + int ratio = 100; + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled) + continue; + if (this_module->expected_compression) + ratio = ratio * this_module->expected_compression() + / 100; + } + + return ratio; +} + +/* toi_find_module_given_dir + * Functionality : Return a module (if found), given a pointer + * to its directory name + */ + +static struct toi_module_ops *toi_find_module_given_dir(char *name) +{ + struct toi_module_ops *this_module, *found_module = NULL; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!strcmp(name, this_module->directory)) { + found_module = this_module; + break; + } + } + + return found_module; +} + +/* toi_find_module_given_name + * Functionality : Return a module (if found), given a pointer + * to its name + */ + +struct toi_module_ops *toi_find_module_given_name(char *name) +{ + struct toi_module_ops *this_module, *found_module = NULL; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!strcmp(name, this_module->name)) { + found_module = this_module; + break; + } + } + + return found_module; +} + +/* + * toi_print_module_debug_info + * Functionality : Get debugging info from modules into a buffer. + */ +int toi_print_module_debug_info(char *buffer, int buffer_size) +{ + struct toi_module_ops *this_module; + int len = 0; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled) + continue; + if (this_module->print_debug_info) { + int result; + result = this_module->print_debug_info(buffer + len, + buffer_size - len); + len += result; + } + } + + /* Ensure null terminated */ + buffer[buffer_size] = 0; + + return len; +} + +/* + * toi_register_module + * + * Register a module. + */ +int toi_register_module(struct toi_module_ops *module) +{ + int i; + struct kobject *kobj; + + if (!hibernation_available()) + return -ENODEV; + + module->enabled = 1; + + if (toi_find_module_given_name(module->name)) { + printk(KERN_INFO "TuxOnIce: Trying to load module %s," + " which is already registered.\n", + module->name); + return -EBUSY; + } + + switch (module->type) { + case FILTER_MODULE: + list_add_tail(&module->type_list, &toi_filters); + toi_num_filters++; + break; + case WRITER_MODULE: + list_add_tail(&module->type_list, &toiAllocators); + toiNumAllocators++; + break; + case MISC_MODULE: + case MISC_HIDDEN_MODULE: + case BIO_ALLOCATOR_MODULE: + break; + default: + printk(KERN_ERR "Hmmm. Module '%s' has an invalid type." + " It has been ignored.\n", module->name); + return -EINVAL; + } + list_add_tail(&module->module_list, &toi_modules); + toi_num_modules++; + + if ((!module->directory && !module->shared_directory) || + !module->sysfs_data || !module->num_sysfs_entries) + return 0; + + /* + * Modules may share a directory, but those with shared_dir + * set must be loaded (via symbol dependencies) after parents + * and unloaded beforehand. + */ + if (module->shared_directory) { + struct toi_module_ops *shared = + toi_find_module_given_dir(module->shared_directory); + if (!shared) { + printk(KERN_ERR "TuxOnIce: Module %s wants to share " + "%s's directory but %s isn't loaded.\n", + module->name, module->shared_directory, + module->shared_directory); + toi_unregister_module(module); + return -ENODEV; + } + kobj = shared->dir_kobj; + } else { + if (!strncmp(module->directory, "[ROOT]", 6)) + kobj = tuxonice_kobj; + else + kobj = make_toi_sysdir(module->directory); + } + module->dir_kobj = kobj; + for (i = 0; i < module->num_sysfs_entries; i++) { + int result = toi_register_sysfs_file(kobj, + &module->sysfs_data[i]); + if (result) + return result; + } + return 0; +} + +/* + * toi_unregister_module + * + * Remove a module. + */ +void toi_unregister_module(struct toi_module_ops *module) +{ + int i; + + if (module->dir_kobj) + for (i = 0; i < module->num_sysfs_entries; i++) + toi_unregister_sysfs_file(module->dir_kobj, + &module->sysfs_data[i]); + + if (!module->shared_directory && module->directory && + strncmp(module->directory, "[ROOT]", 6)) + remove_toi_sysdir(module->dir_kobj); + + switch (module->type) { + case FILTER_MODULE: + list_del(&module->type_list); + toi_num_filters--; + break; + case WRITER_MODULE: + list_del(&module->type_list); + toiNumAllocators--; + if (toiActiveAllocator == module) { + toiActiveAllocator = NULL; + clear_toi_state(TOI_CAN_RESUME); + clear_toi_state(TOI_CAN_HIBERNATE); + } + break; + case MISC_MODULE: + case MISC_HIDDEN_MODULE: + case BIO_ALLOCATOR_MODULE: + break; + default: + printk(KERN_ERR "Module '%s' has an invalid type." + " It has been ignored.\n", module->name); + return; + } + list_del(&module->module_list); + toi_num_modules--; +} + +/* + * toi_move_module_tail + * + * Rearrange modules when reloading the config. + */ +void toi_move_module_tail(struct toi_module_ops *module) +{ + switch (module->type) { + case FILTER_MODULE: + if (toi_num_filters > 1) + list_move_tail(&module->type_list, &toi_filters); + break; + case WRITER_MODULE: + if (toiNumAllocators > 1) + list_move_tail(&module->type_list, &toiAllocators); + break; + case MISC_MODULE: + case MISC_HIDDEN_MODULE: + case BIO_ALLOCATOR_MODULE: + break; + default: + printk(KERN_ERR "Module '%s' has an invalid type." + " It has been ignored.\n", module->name); + return; + } + if ((toi_num_filters + toiNumAllocators) > 1) + list_move_tail(&module->module_list, &toi_modules); +} + +/* + * toi_initialise_modules + * + * Get ready to do some work! + */ +int toi_initialise_modules(int starting_cycle, int early) +{ + struct toi_module_ops *this_module; + int result; + + list_for_each_entry(this_module, &toi_modules, module_list) { + this_module->header_requested = 0; + this_module->header_used = 0; + if (!this_module->enabled) + continue; + if (this_module->early != early) + continue; + if (this_module->initialise) { + result = this_module->initialise(starting_cycle); + if (result) { + toi_cleanup_modules(starting_cycle); + return result; + } + this_module->initialised = 1; + } + } + + return 0; +} + +/* + * toi_cleanup_modules + * + * Tell modules the work is done. + */ +void toi_cleanup_modules(int finishing_cycle) +{ + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || !this_module->initialised) + continue; + if (this_module->cleanup) + this_module->cleanup(finishing_cycle); + this_module->initialised = 0; + } +} + +/* + * toi_pre_atomic_restore_modules + * + * Get ready to do some work! + */ +void toi_pre_atomic_restore_modules(struct toi_boot_kernel_data *bkd) +{ + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (this_module->enabled && this_module->pre_atomic_restore) + this_module->pre_atomic_restore(bkd); + } +} + +/* + * toi_post_atomic_restore_modules + * + * Get ready to do some work! + */ +void toi_post_atomic_restore_modules(struct toi_boot_kernel_data *bkd) +{ + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (this_module->enabled && this_module->post_atomic_restore) + this_module->post_atomic_restore(bkd); + } +} + +/* + * toi_get_next_filter + * + * Get the next filter in the pipeline. + */ +struct toi_module_ops *toi_get_next_filter(struct toi_module_ops *filter_sought) +{ + struct toi_module_ops *last_filter = NULL, *this_filter = NULL; + + list_for_each_entry(this_filter, &toi_filters, type_list) { + if (!this_filter->enabled) + continue; + if ((last_filter == filter_sought) || (!filter_sought)) + return this_filter; + last_filter = this_filter; + } + + return toiActiveAllocator; +} + +/** + * toi_show_modules: Printk what support is loaded. + */ +void toi_print_modules(void) +{ + struct toi_module_ops *this_module; + int prev = 0; + + printk(KERN_INFO "TuxOnIce " TOI_CORE_VERSION ", with support for"); + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (this_module->type == MISC_HIDDEN_MODULE) + continue; + printk("%s %s%s%s", prev ? "," : "", + this_module->enabled ? "" : "[", + this_module->name, + this_module->enabled ? "" : "]"); + prev = 1; + } + + printk(".\n"); +} + +/* toi_get_modules + * + * Take a reference to modules so they can't go away under us. + */ + +int toi_get_modules(void) +{ + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + struct toi_module_ops *this_module2; + + if (try_module_get(this_module->module)) + continue; + + /* Failed! Reverse gets and return error */ + list_for_each_entry(this_module2, &toi_modules, + module_list) { + if (this_module == this_module2) + return -EINVAL; + module_put(this_module2->module); + } + } + return 0; +} + +/* toi_put_modules + * + * Release our references to modules we used. + */ + +void toi_put_modules(void) +{ + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) + module_put(this_module->module); +} diff --git b/kernel/power/tuxonice_modules.h b/kernel/power/tuxonice_modules.h new file mode 100644 index 0000000..44f10ab --- /dev/null +++ b/kernel/power/tuxonice_modules.h @@ -0,0 +1,212 @@ +/* + * kernel/power/tuxonice_modules.h + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * It contains declarations for modules. Modules are additions to + * TuxOnIce that provide facilities such as image compression or + * encryption, backends for storage of the image and user interfaces. + * + */ + +#ifndef TOI_MODULES_H +#define TOI_MODULES_H + +/* This is the maximum size we store in the image header for a module name */ +#define TOI_MAX_MODULE_NAME_LENGTH 30 + +struct toi_boot_kernel_data; + +/* Per-module metadata */ +struct toi_module_header { + char name[TOI_MAX_MODULE_NAME_LENGTH]; + int enabled; + int type; + int index; + int data_length; + unsigned long signature; +}; + +enum { + FILTER_MODULE, + WRITER_MODULE, + BIO_ALLOCATOR_MODULE, + MISC_MODULE, + MISC_HIDDEN_MODULE, +}; + +enum { + TOI_ASYNC, + TOI_SYNC +}; + +enum { + TOI_VIRT, + TOI_PAGE, +}; + +#define TOI_MAP(type, addr) \ + (type == TOI_PAGE ? kmap(addr) : addr) + +#define TOI_UNMAP(type, addr) \ + do { \ + if (type == TOI_PAGE) \ + kunmap(addr); \ + } while(0) + +struct toi_module_ops { + /* Functions common to all modules */ + int type; + char *name; + char *directory; + char *shared_directory; + struct kobject *dir_kobj; + struct module *module; + int enabled, early, initialised; + struct list_head module_list; + + /* List of filters or allocators */ + struct list_head list, type_list; + + /* + * Requirements for memory and storage in + * the image header.. + */ + int (*memory_needed) (void); + int (*storage_needed) (void); + + int header_requested, header_used; + + int (*expected_compression) (void); + + /* + * Debug info + */ + int (*print_debug_info) (char *buffer, int size); + int (*save_config_info) (char *buffer); + void (*load_config_info) (char *buffer, int len); + + /* + * Initialise & cleanup - general routines called + * at the start and end of a cycle. + */ + int (*initialise) (int starting_cycle); + void (*cleanup) (int finishing_cycle); + + void (*pre_atomic_restore) (struct toi_boot_kernel_data *bkd); + void (*post_atomic_restore) (struct toi_boot_kernel_data *bkd); + + /* + * Calls for allocating storage (allocators only). + * + * Header space is requested separately and cannot fail, but the + * reservation is only applied when main storage is allocated. + * The header space reservation is thus always set prior to + * requesting the allocation of storage - and prior to querying + * how much storage is available. + */ + + unsigned long (*storage_available) (void); + void (*reserve_header_space) (unsigned long space_requested); + int (*register_storage) (void); + int (*allocate_storage) (unsigned long space_requested); + unsigned long (*storage_allocated) (void); + void (*free_unused_storage) (void); + + /* + * Routines used in image I/O. + */ + int (*rw_init) (int rw, int stream_number); + int (*rw_cleanup) (int rw); + int (*write_page) (unsigned long index, int buf_type, void *buf, + unsigned int buf_size); + int (*read_page) (unsigned long *index, int buf_type, void *buf, + unsigned int *buf_size); + int (*io_flusher) (int rw); + + /* Reset module if image exists but reading aborted */ + void (*noresume_reset) (void); + + /* Read and write the metadata */ + int (*write_header_init) (void); + int (*write_header_cleanup) (void); + + int (*read_header_init) (void); + int (*read_header_cleanup) (void); + + /* To be called after read_header_init */ + int (*get_header_version) (void); + + int (*rw_header_chunk) (int rw, struct toi_module_ops *owner, + char *buffer_start, int buffer_size); + + int (*rw_header_chunk_noreadahead) (int rw, + struct toi_module_ops *owner, char *buffer_start, + int buffer_size); + + /* Attempt to parse an image location */ + int (*parse_sig_location) (char *buffer, int only_writer, int quiet); + + /* Throttle I/O according to throughput */ + void (*update_throughput_throttle) (int jif_index); + + /* Flush outstanding I/O */ + int (*finish_all_io) (void); + + /* Determine whether image exists that we can restore */ + int (*image_exists) (int quiet); + + /* Mark the image as having tried to resume */ + int (*mark_resume_attempted) (int); + + /* Destroy image if one exists */ + int (*remove_image) (void); + + /* Sysfs Data */ + struct toi_sysfs_data *sysfs_data; + int num_sysfs_entries; + + /* Block I/O allocator */ + struct toi_bio_allocator_ops *bio_allocator_ops; +}; + +extern int toi_num_modules, toiNumAllocators; + +extern struct toi_module_ops *toiActiveAllocator; +extern struct list_head toi_filters, toiAllocators, toi_modules; + +extern void toi_prepare_console_modules(void); +extern void toi_cleanup_console_modules(void); + +extern struct toi_module_ops *toi_find_module_given_name(char *name); +extern struct toi_module_ops *toi_get_next_filter(struct toi_module_ops *); + +extern int toi_register_module(struct toi_module_ops *module); +extern void toi_move_module_tail(struct toi_module_ops *module); + +extern long toi_header_storage_for_modules(void); +extern long toi_memory_for_modules(int print_parts); +extern void print_toi_header_storage_for_modules(void); +extern int toi_expected_compression_ratio(void); + +extern int toi_print_module_debug_info(char *buffer, int buffer_size); +extern int toi_register_module(struct toi_module_ops *module); +extern void toi_unregister_module(struct toi_module_ops *module); + +extern int toi_initialise_modules(int starting_cycle, int early); +#define toi_initialise_modules_early(starting) \ + toi_initialise_modules(starting, 1) +#define toi_initialise_modules_late(starting) \ + toi_initialise_modules(starting, 0) +extern void toi_cleanup_modules(int finishing_cycle); + +extern void toi_post_atomic_restore_modules(struct toi_boot_kernel_data *bkd); +extern void toi_pre_atomic_restore_modules(struct toi_boot_kernel_data *bkd); + +extern void toi_print_modules(void); + +int toi_get_modules(void); +void toi_put_modules(void); +#endif diff --git b/kernel/power/tuxonice_netlink.c b/kernel/power/tuxonice_netlink.c new file mode 100644 index 0000000..78bd31b --- /dev/null +++ b/kernel/power/tuxonice_netlink.c @@ -0,0 +1,324 @@ +/* + * kernel/power/tuxonice_netlink.c + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Functions for communicating with a userspace helper via netlink. + */ + +#include +#include +#include +#include "tuxonice_netlink.h" +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_alloc.h" +#include "tuxonice_builtin.h" + +static struct user_helper_data *uhd_list; + +/* + * Refill our pool of SKBs for use in emergencies (eg, when eating memory and + * none can be allocated). + */ +static void toi_fill_skb_pool(struct user_helper_data *uhd) +{ + while (uhd->pool_level < uhd->pool_limit) { + struct sk_buff *new_skb = + alloc_skb(NLMSG_SPACE(uhd->skb_size), TOI_ATOMIC_GFP); + + if (!new_skb) + break; + + new_skb->next = uhd->emerg_skbs; + uhd->emerg_skbs = new_skb; + uhd->pool_level++; + } +} + +/* + * Try to allocate a single skb. If we can't get one, try to use one from + * our pool. + */ +static struct sk_buff *toi_get_skb(struct user_helper_data *uhd) +{ + struct sk_buff *skb = + alloc_skb(NLMSG_SPACE(uhd->skb_size), TOI_ATOMIC_GFP); + + if (skb) + return skb; + + skb = uhd->emerg_skbs; + if (skb) { + uhd->pool_level--; + uhd->emerg_skbs = skb->next; + skb->next = NULL; + } + + return skb; +} + +void toi_send_netlink_message(struct user_helper_data *uhd, + int type, void *params, size_t len) +{ + struct sk_buff *skb; + struct nlmsghdr *nlh; + void *dest; + struct task_struct *t; + + if (uhd->pid == -1) + return; + + if (uhd->debug) + printk(KERN_ERR "toi_send_netlink_message: Send " + "message type %d.\n", type); + + skb = toi_get_skb(uhd); + if (!skb) { + printk(KERN_INFO "toi_netlink: Can't allocate skb!\n"); + return; + } + + nlh = nlmsg_put(skb, 0, uhd->sock_seq, type, len, 0); + uhd->sock_seq++; + + dest = NLMSG_DATA(nlh); + if (params && len > 0) + memcpy(dest, params, len); + + netlink_unicast(uhd->nl, skb, uhd->pid, 0); + + toi_read_lock_tasklist(); + t = find_task_by_pid_ns(uhd->pid, &init_pid_ns); + if (!t) { + toi_read_unlock_tasklist(); + if (uhd->pid > -1) + printk(KERN_INFO "Hmm. Can't find the userspace task" + " %d.\n", uhd->pid); + return; + } + wake_up_process(t); + toi_read_unlock_tasklist(); + + yield(); +} + +static void send_whether_debugging(struct user_helper_data *uhd) +{ + static u8 is_debugging = 1; + + toi_send_netlink_message(uhd, NETLINK_MSG_IS_DEBUGGING, + &is_debugging, sizeof(u8)); +} + +/* + * Set the PF_NOFREEZE flag on the given process to ensure it can run whilst we + * are hibernating. + */ +static int nl_set_nofreeze(struct user_helper_data *uhd, __u32 pid) +{ + struct task_struct *t; + + if (uhd->debug) + printk(KERN_ERR "nl_set_nofreeze for pid %d.\n", pid); + + toi_read_lock_tasklist(); + t = find_task_by_pid_ns(pid, &init_pid_ns); + if (!t) { + toi_read_unlock_tasklist(); + printk(KERN_INFO "Strange. Can't find the userspace task %d.\n", + pid); + return -EINVAL; + } + + t->flags |= PF_NOFREEZE; + + toi_read_unlock_tasklist(); + uhd->pid = pid; + + toi_send_netlink_message(uhd, NETLINK_MSG_NOFREEZE_ACK, NULL, 0); + + return 0; +} + +/* + * Called when the userspace process has informed us that it's ready to roll. + */ +static int nl_ready(struct user_helper_data *uhd, u32 version) +{ + if (version != uhd->interface_version) { + printk(KERN_INFO "%s userspace process using invalid interface" + " version (%d - kernel wants %d). Trying to " + "continue without it.\n", + uhd->name, version, uhd->interface_version); + if (uhd->not_ready) + uhd->not_ready(); + return -EINVAL; + } + + complete(&uhd->wait_for_process); + + return 0; +} + +void toi_netlink_close_complete(struct user_helper_data *uhd) +{ + if (uhd->nl) { + netlink_kernel_release(uhd->nl); + uhd->nl = NULL; + } + + while (uhd->emerg_skbs) { + struct sk_buff *next = uhd->emerg_skbs->next; + kfree_skb(uhd->emerg_skbs); + uhd->emerg_skbs = next; + } + + uhd->pid = -1; +} + +static int toi_nl_gen_rcv_msg(struct user_helper_data *uhd, + struct sk_buff *skb, struct nlmsghdr *nlh) +{ + int type = nlh->nlmsg_type; + int *data; + int err; + + if (uhd->debug) + printk(KERN_ERR "toi_user_rcv_skb: Received message %d.\n", + type); + + /* Let the more specific handler go first. It returns + * 1 for valid messages that it doesn't know. */ + err = uhd->rcv_msg(skb, nlh); + if (err != 1) + return err; + + /* Only allow one task to receive NOFREEZE privileges */ + if (type == NETLINK_MSG_NOFREEZE_ME && uhd->pid != -1) { + printk(KERN_INFO "Received extra nofreeze me requests.\n"); + return -EBUSY; + } + + data = NLMSG_DATA(nlh); + + switch (type) { + case NETLINK_MSG_NOFREEZE_ME: + return nl_set_nofreeze(uhd, nlh->nlmsg_pid); + case NETLINK_MSG_GET_DEBUGGING: + send_whether_debugging(uhd); + return 0; + case NETLINK_MSG_READY: + if (nlh->nlmsg_len != NLMSG_LENGTH(sizeof(u32))) { + printk(KERN_INFO "Invalid ready mesage.\n"); + if (uhd->not_ready) + uhd->not_ready(); + return -EINVAL; + } + return nl_ready(uhd, (u32) *data); + case NETLINK_MSG_CLEANUP: + toi_netlink_close_complete(uhd); + return 0; + } + + return -EINVAL; +} + +static void toi_user_rcv_skb(struct sk_buff *skb) +{ + int err; + struct nlmsghdr *nlh; + struct user_helper_data *uhd = uhd_list; + + while (uhd && uhd->netlink_id != skb->sk->sk_protocol) + uhd = uhd->next; + + if (!uhd) + return; + + while (skb->len >= NLMSG_SPACE(0)) { + u32 rlen; + + nlh = (struct nlmsghdr *) skb->data; + if (nlh->nlmsg_len < sizeof(*nlh) || skb->len < nlh->nlmsg_len) + return; + + rlen = NLMSG_ALIGN(nlh->nlmsg_len); + if (rlen > skb->len) + rlen = skb->len; + + err = toi_nl_gen_rcv_msg(uhd, skb, nlh); + if (err) + netlink_ack(skb, nlh, err); + else if (nlh->nlmsg_flags & NLM_F_ACK) + netlink_ack(skb, nlh, 0); + skb_pull(skb, rlen); + } +} + +static int netlink_prepare(struct user_helper_data *uhd) +{ + struct netlink_kernel_cfg cfg = { + .groups = 0, + .input = toi_user_rcv_skb, + }; + + uhd->next = uhd_list; + uhd_list = uhd; + + uhd->sock_seq = 0x42c0ffee; + uhd->nl = netlink_kernel_create(&init_net, uhd->netlink_id, &cfg); + if (!uhd->nl) { + printk(KERN_INFO "Failed to allocate netlink socket for %s.\n", + uhd->name); + return -ENOMEM; + } + + toi_fill_skb_pool(uhd); + + return 0; +} + +void toi_netlink_close(struct user_helper_data *uhd) +{ + struct task_struct *t; + + toi_read_lock_tasklist(); + t = find_task_by_pid_ns(uhd->pid, &init_pid_ns); + if (t) + t->flags &= ~PF_NOFREEZE; + toi_read_unlock_tasklist(); + + toi_send_netlink_message(uhd, NETLINK_MSG_CLEANUP, NULL, 0); +} +int toi_netlink_setup(struct user_helper_data *uhd) +{ + /* In case userui didn't cleanup properly on us */ + toi_netlink_close_complete(uhd); + + if (netlink_prepare(uhd) < 0) { + printk(KERN_INFO "Netlink prepare failed.\n"); + return 1; + } + + if (toi_launch_userspace_program(uhd->program, uhd->netlink_id, + UMH_WAIT_EXEC, uhd->debug) < 0) { + printk(KERN_INFO "Launch userspace program failed.\n"); + toi_netlink_close_complete(uhd); + return 1; + } + + /* Wait 2 seconds for the userspace process to make contact */ + wait_for_completion_timeout(&uhd->wait_for_process, 2*HZ); + + if (uhd->pid == -1) { + printk(KERN_INFO "%s: Failed to contact userspace process.\n", + uhd->name); + toi_netlink_close_complete(uhd); + return 1; + } + + return 0; +} diff --git b/kernel/power/tuxonice_netlink.h b/kernel/power/tuxonice_netlink.h new file mode 100644 index 0000000..6613c8e --- /dev/null +++ b/kernel/power/tuxonice_netlink.h @@ -0,0 +1,62 @@ +/* + * kernel/power/tuxonice_netlink.h + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Declarations for functions for communicating with a userspace helper + * via netlink. + */ + +#include +#include + +#define NETLINK_MSG_BASE 0x10 + +#define NETLINK_MSG_READY 0x10 +#define NETLINK_MSG_NOFREEZE_ME 0x16 +#define NETLINK_MSG_GET_DEBUGGING 0x19 +#define NETLINK_MSG_CLEANUP 0x24 +#define NETLINK_MSG_NOFREEZE_ACK 0x27 +#define NETLINK_MSG_IS_DEBUGGING 0x28 + +struct user_helper_data { + int (*rcv_msg) (struct sk_buff *skb, struct nlmsghdr *nlh); + void (*not_ready) (void); + struct sock *nl; + u32 sock_seq; + pid_t pid; + char *comm; + char program[256]; + int pool_level; + int pool_limit; + struct sk_buff *emerg_skbs; + int skb_size; + int netlink_id; + char *name; + struct user_helper_data *next; + struct completion wait_for_process; + u32 interface_version; + int must_init; + int debug; +}; + +#ifdef CONFIG_NET +int toi_netlink_setup(struct user_helper_data *uhd); +void toi_netlink_close(struct user_helper_data *uhd); +void toi_send_netlink_message(struct user_helper_data *uhd, + int type, void *params, size_t len); +void toi_netlink_close_complete(struct user_helper_data *uhd); +#else +static inline int toi_netlink_setup(struct user_helper_data *uhd) +{ + return 0; +} + +static inline void toi_netlink_close(struct user_helper_data *uhd) { }; +static inline void toi_send_netlink_message(struct user_helper_data *uhd, + int type, void *params, size_t len) { }; +static inline void toi_netlink_close_complete(struct user_helper_data *uhd) + { }; +#endif diff --git b/kernel/power/tuxonice_pagedir.c b/kernel/power/tuxonice_pagedir.c new file mode 100644 index 0000000..d469f3d --- /dev/null +++ b/kernel/power/tuxonice_pagedir.c @@ -0,0 +1,345 @@ +/* + * kernel/power/tuxonice_pagedir.c + * + * Copyright (C) 1998-2001 Gabor Kuti + * Copyright (C) 1998,2001,2002 Pavel Machek + * Copyright (C) 2002-2003 Florent Chabaud + * Copyright (C) 2006-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Routines for handling pagesets. + * Note that pbes aren't actually stored as such. They're stored as + * bitmaps and extents. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "tuxonice_pageflags.h" +#include "tuxonice_ui.h" +#include "tuxonice_pagedir.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice.h" +#include "tuxonice_builtin.h" +#include "tuxonice_alloc.h" + +static int ptoi_pfn; +static struct pbe *this_low_pbe; +static struct pbe **last_low_pbe_ptr; + +void toi_reset_alt_image_pageset2_pfn(void) +{ + memory_bm_position_reset(pageset2_map); +} + +static struct page *first_conflicting_page; + +/* + * free_conflicting_pages + */ + +static void free_conflicting_pages(void) +{ + while (first_conflicting_page) { + struct page *next = + *((struct page **) kmap(first_conflicting_page)); + kunmap(first_conflicting_page); + toi__free_page(29, first_conflicting_page); + first_conflicting_page = next; + } +} + +/* __toi_get_nonconflicting_page + * + * Description: Gets order zero pages that won't be overwritten + * while copying the original pages. + */ + +struct page *___toi_get_nonconflicting_page(int can_be_highmem) +{ + struct page *page; + gfp_t flags = TOI_ATOMIC_GFP; + if (can_be_highmem) + flags |= __GFP_HIGHMEM; + + + if (test_toi_state(TOI_LOADING_ALT_IMAGE) && + pageset2_map && ptoi_pfn) { + do { + ptoi_pfn = memory_bm_next_pfn(pageset2_map, 0); + if (ptoi_pfn != BM_END_OF_MAP) { + page = pfn_to_page(ptoi_pfn); + if (!PagePageset1(page) && + (can_be_highmem || !PageHighMem(page))) + return page; + } + } while (ptoi_pfn); + } + + do { + page = toi_alloc_page(29, flags | __GFP_ZERO); + if (!page) { + printk(KERN_INFO "Failed to get nonconflicting " + "page.\n"); + return NULL; + } + if (PagePageset1(page)) { + struct page **next = (struct page **) kmap(page); + *next = first_conflicting_page; + first_conflicting_page = page; + kunmap(page); + } + } while (PagePageset1(page)); + + return page; +} + +unsigned long __toi_get_nonconflicting_page(void) +{ + struct page *page = ___toi_get_nonconflicting_page(0); + return page ? (unsigned long) page_address(page) : 0; +} + +static struct pbe *get_next_pbe(struct page **page_ptr, struct pbe *this_pbe, + int highmem) +{ + if (((((unsigned long) this_pbe) & (PAGE_SIZE - 1)) + + 2 * sizeof(struct pbe)) > PAGE_SIZE) { + struct page *new_page = + ___toi_get_nonconflicting_page(highmem); + if (!new_page) + return ERR_PTR(-ENOMEM); + this_pbe = (struct pbe *) kmap(new_page); + memset(this_pbe, 0, PAGE_SIZE); + *page_ptr = new_page; + } else + this_pbe++; + + return this_pbe; +} + +/** + * get_pageset1_load_addresses - generate pbes for conflicting pages + * + * We check here that pagedir & pages it points to won't collide + * with pages where we're going to restore from the loaded pages + * later. + * + * Returns: + * Zero on success, one if couldn't find enough pages (shouldn't + * happen). + **/ +int toi_get_pageset1_load_addresses(void) +{ + int pfn, highallocd = 0, lowallocd = 0; + int low_needed = pagedir1.size - get_highmem_size(pagedir1); + int high_needed = get_highmem_size(pagedir1); + int low_pages_for_highmem = 0; + gfp_t flags = GFP_ATOMIC | __GFP_NOWARN | __GFP_HIGHMEM; + struct page *page, *high_pbe_page = NULL, *last_high_pbe_page = NULL, + *low_pbe_page, *last_low_pbe_page = NULL; + struct pbe **last_high_pbe_ptr = &restore_highmem_pblist, + *this_high_pbe = NULL; + unsigned long orig_low_pfn, orig_high_pfn; + int high_pbes_done = 0, low_pbes_done = 0; + int low_direct = 0, high_direct = 0, result = 0, i; + int high_page = 1, high_offset = 0, low_page = 1, low_offset = 0; + + toi_trace_index++; + + memory_bm_position_reset(pageset1_map); + memory_bm_position_reset(pageset1_copy_map); + + last_low_pbe_ptr = &restore_pblist; + + /* First, allocate pages for the start of our pbe lists. */ + if (high_needed) { + high_pbe_page = ___toi_get_nonconflicting_page(1); + if (!high_pbe_page) { + result = -ENOMEM; + goto out; + } + this_high_pbe = (struct pbe *) kmap(high_pbe_page); + memset(this_high_pbe, 0, PAGE_SIZE); + } + + low_pbe_page = ___toi_get_nonconflicting_page(0); + if (!low_pbe_page) { + result = -ENOMEM; + goto out; + } + this_low_pbe = (struct pbe *) page_address(low_pbe_page); + + /* + * Next, allocate the number of pages we need. + */ + + i = low_needed + high_needed; + + do { + int is_high; + + if (i == low_needed) + flags &= ~__GFP_HIGHMEM; + + page = toi_alloc_page(30, flags); + BUG_ON(!page); + + SetPagePageset1Copy(page); + is_high = PageHighMem(page); + + if (PagePageset1(page)) { + if (is_high) + high_direct++; + else + low_direct++; + } else { + if (is_high) + highallocd++; + else + lowallocd++; + } + } while (--i); + + high_needed -= high_direct; + low_needed -= low_direct; + + /* + * Do we need to use some lowmem pages for the copies of highmem + * pages? + */ + if (high_needed > highallocd) { + low_pages_for_highmem = high_needed - highallocd; + high_needed -= low_pages_for_highmem; + low_needed += low_pages_for_highmem; + } + + /* + * Now generate our pbes (which will be used for the atomic restore), + * and free unneeded pages. + */ + memory_bm_position_reset(pageset1_copy_map); + for (pfn = memory_bm_next_pfn(pageset1_copy_map, 0); pfn != BM_END_OF_MAP; + pfn = memory_bm_next_pfn(pageset1_copy_map, 0)) { + int is_high; + page = pfn_to_page(pfn); + is_high = PageHighMem(page); + + if (PagePageset1(page)) + continue; + + /* Nope. We're going to use this page. Add a pbe. */ + if (is_high || low_pages_for_highmem) { + struct page *orig_page; + high_pbes_done++; + if (!is_high) + low_pages_for_highmem--; + do { + orig_high_pfn = memory_bm_next_pfn(pageset1_map, 0); + BUG_ON(orig_high_pfn == BM_END_OF_MAP); + orig_page = pfn_to_page(orig_high_pfn); + } while (!PageHighMem(orig_page) || + PagePageset1Copy(orig_page)); + + this_high_pbe->orig_address = (void *) orig_high_pfn; + this_high_pbe->address = page; + this_high_pbe->next = NULL; + toi_message(TOI_PAGEDIR, TOI_VERBOSE, 0, "High pbe %d/%d: %p(%d)=>%p", + high_page, high_offset, page, orig_high_pfn, orig_page); + if (last_high_pbe_page != high_pbe_page) { + *last_high_pbe_ptr = + (struct pbe *) high_pbe_page; + if (last_high_pbe_page) { + kunmap(last_high_pbe_page); + high_page++; + high_offset = 0; + } else + high_offset++; + last_high_pbe_page = high_pbe_page; + } else { + *last_high_pbe_ptr = this_high_pbe; + high_offset++; + } + last_high_pbe_ptr = &this_high_pbe->next; + this_high_pbe = get_next_pbe(&high_pbe_page, + this_high_pbe, 1); + if (IS_ERR(this_high_pbe)) { + printk(KERN_INFO + "This high pbe is an error.\n"); + return -ENOMEM; + } + } else { + struct page *orig_page; + low_pbes_done++; + do { + orig_low_pfn = memory_bm_next_pfn(pageset1_map, 0); + BUG_ON(orig_low_pfn == BM_END_OF_MAP); + orig_page = pfn_to_page(orig_low_pfn); + } while (PageHighMem(orig_page) || + PagePageset1Copy(orig_page)); + + this_low_pbe->orig_address = page_address(orig_page); + this_low_pbe->address = page_address(page); + this_low_pbe->next = NULL; + toi_message(TOI_PAGEDIR, TOI_VERBOSE, 0, "Low pbe %d/%d: %p(%d)=>%p", + low_page, low_offset, this_low_pbe->orig_address, + orig_low_pfn, this_low_pbe->address); + TOI_TRACE_DEBUG(orig_low_pfn, "LoadAddresses (%d/%d): %p=>%p", low_page, low_offset, this_low_pbe->orig_address, this_low_pbe->address); + *last_low_pbe_ptr = this_low_pbe; + last_low_pbe_ptr = &this_low_pbe->next; + this_low_pbe = get_next_pbe(&low_pbe_page, + this_low_pbe, 0); + if (low_pbe_page != last_low_pbe_page) { + if (last_low_pbe_page) { + low_page++; + low_offset = 0; + } else { + low_offset++; + } + last_low_pbe_page = low_pbe_page; + } else + low_offset++; + if (IS_ERR(this_low_pbe)) { + printk(KERN_INFO "this_low_pbe is an error.\n"); + return -ENOMEM; + } + } + } + + if (high_pbe_page) + kunmap(high_pbe_page); + + if (last_high_pbe_page != high_pbe_page) { + if (last_high_pbe_page) + kunmap(last_high_pbe_page); + toi__free_page(29, high_pbe_page); + } + + free_conflicting_pages(); + +out: + return result; +} + +int add_boot_kernel_data_pbe(void) +{ + this_low_pbe->address = (char *) __toi_get_nonconflicting_page(); + if (!this_low_pbe->address) { + printk(KERN_INFO "Failed to get bkd atomic restore buffer."); + return -ENOMEM; + } + + toi_bkd.size = sizeof(toi_bkd); + memcpy(this_low_pbe->address, &toi_bkd, sizeof(toi_bkd)); + + *last_low_pbe_ptr = this_low_pbe; + this_low_pbe->orig_address = (char *) boot_kernel_data_buffer; + this_low_pbe->next = NULL; + return 0; +} diff --git b/kernel/power/tuxonice_pagedir.h b/kernel/power/tuxonice_pagedir.h new file mode 100644 index 0000000..0465359 --- /dev/null +++ b/kernel/power/tuxonice_pagedir.h @@ -0,0 +1,50 @@ +/* + * kernel/power/tuxonice_pagedir.h + * + * Copyright (C) 2006-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Declarations for routines for handling pagesets. + */ + +#ifndef KERNEL_POWER_PAGEDIR_H +#define KERNEL_POWER_PAGEDIR_H + +/* Pagedir + * + * Contains the metadata for a set of pages saved in the image. + */ + +struct pagedir { + int id; + unsigned long size; +#ifdef CONFIG_HIGHMEM + unsigned long size_high; +#endif +}; + +#ifdef CONFIG_HIGHMEM +#define get_highmem_size(pagedir) (pagedir.size_high) +#define set_highmem_size(pagedir, sz) do { pagedir.size_high = sz; } while (0) +#define inc_highmem_size(pagedir) do { pagedir.size_high++; } while (0) +#define get_lowmem_size(pagedir) (pagedir.size - pagedir.size_high) +#else +#define get_highmem_size(pagedir) (0) +#define set_highmem_size(pagedir, sz) do { } while (0) +#define inc_highmem_size(pagedir) do { } while (0) +#define get_lowmem_size(pagedir) (pagedir.size) +#endif + +extern struct pagedir pagedir1, pagedir2; + +extern void toi_copy_pageset1(void); + +extern int toi_get_pageset1_load_addresses(void); + +extern unsigned long __toi_get_nonconflicting_page(void); +struct page *___toi_get_nonconflicting_page(int can_be_highmem); + +extern void toi_reset_alt_image_pageset2_pfn(void); +extern int add_boot_kernel_data_pbe(void); +#endif diff --git b/kernel/power/tuxonice_pageflags.c b/kernel/power/tuxonice_pageflags.c new file mode 100644 index 0000000..0fe92ed --- /dev/null +++ b/kernel/power/tuxonice_pageflags.c @@ -0,0 +1,18 @@ +/* + * kernel/power/tuxonice_pageflags.c + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Routines for serialising and relocating pageflags in which we + * store our image metadata. + */ + +#include "tuxonice_pageflags.h" +#include "power.h" + +int toi_pageflags_space_needed(void) +{ + return memory_bm_space_needed(pageset1_map); +} diff --git b/kernel/power/tuxonice_pageflags.h b/kernel/power/tuxonice_pageflags.h new file mode 100644 index 0000000..ddeeaf1 --- /dev/null +++ b/kernel/power/tuxonice_pageflags.h @@ -0,0 +1,106 @@ +/* + * kernel/power/tuxonice_pageflags.h + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + */ + +#ifndef KERNEL_POWER_TUXONICE_PAGEFLAGS_H +#define KERNEL_POWER_TUXONICE_PAGEFLAGS_H + +struct memory_bitmap; +void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free); +void memory_bm_clear(struct memory_bitmap *bm); + +int mem_bm_set_bit_check(struct memory_bitmap *bm, int index, unsigned long pfn); +void memory_bm_set_bit(struct memory_bitmap *bm, int index, unsigned long pfn); +unsigned long memory_bm_next_pfn(struct memory_bitmap *bm, int index); +unsigned long memory_bm_next_pfn_index(struct memory_bitmap *bm, int index); +void memory_bm_position_reset(struct memory_bitmap *bm); +void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free); +int toi_alloc_bitmap(struct memory_bitmap **bm); +void toi_free_bitmap(struct memory_bitmap **bm); +void memory_bm_clear(struct memory_bitmap *bm); +void memory_bm_clear_bit(struct memory_bitmap *bm, int index, unsigned long pfn); +void memory_bm_set_bit(struct memory_bitmap *bm, int index, unsigned long pfn); +int memory_bm_test_bit(struct memory_bitmap *bm, int index, unsigned long pfn); +int memory_bm_test_bit_index(struct memory_bitmap *bm, int index, unsigned long pfn); +void memory_bm_clear_bit_index(struct memory_bitmap *bm, int index, unsigned long pfn); + +struct toi_module_ops; +int memory_bm_write(struct memory_bitmap *bm, int (*rw_chunk) + (int rw, struct toi_module_ops *owner, char *buffer, int buffer_size)); +int memory_bm_read(struct memory_bitmap *bm, int (*rw_chunk) + (int rw, struct toi_module_ops *owner, char *buffer, int buffer_size)); +int memory_bm_space_needed(struct memory_bitmap *bm); + +extern struct memory_bitmap *pageset1_map; +extern struct memory_bitmap *pageset1_copy_map; +extern struct memory_bitmap *pageset2_map; +extern struct memory_bitmap *page_resave_map; +extern struct memory_bitmap *io_map; +extern struct memory_bitmap *nosave_map; +extern struct memory_bitmap *free_map; +extern struct memory_bitmap *compare_map; + +#define PagePageset1(page) \ + (pageset1_map && memory_bm_test_bit(pageset1_map, smp_processor_id(), page_to_pfn(page))) +#define SetPagePageset1(page) \ + (memory_bm_set_bit(pageset1_map, smp_processor_id(), page_to_pfn(page))) +#define ClearPagePageset1(page) \ + (memory_bm_clear_bit(pageset1_map, smp_processor_id(), page_to_pfn(page))) + +#define PagePageset1Copy(page) \ + (memory_bm_test_bit(pageset1_copy_map, smp_processor_id(), page_to_pfn(page))) +#define SetPagePageset1Copy(page) \ + (memory_bm_set_bit(pageset1_copy_map, smp_processor_id(), page_to_pfn(page))) +#define ClearPagePageset1Copy(page) \ + (memory_bm_clear_bit(pageset1_copy_map, smp_processor_id(), page_to_pfn(page))) + +#define PagePageset2(page) \ + (memory_bm_test_bit(pageset2_map, smp_processor_id(), page_to_pfn(page))) +#define SetPagePageset2(page) \ + (memory_bm_set_bit(pageset2_map, smp_processor_id(), page_to_pfn(page))) +#define ClearPagePageset2(page) \ + (memory_bm_clear_bit(pageset2_map, smp_processor_id(), page_to_pfn(page))) + +#define PageWasRW(page) \ + (memory_bm_test_bit(pageset2_map, smp_processor_id(), page_to_pfn(page))) +#define SetPageWasRW(page) \ + (memory_bm_set_bit(pageset2_map, smp_processor_id(), page_to_pfn(page))) +#define ClearPageWasRW(page) \ + (memory_bm_clear_bit(pageset2_map, smp_processor_id(), page_to_pfn(page))) + +#define PageResave(page) (page_resave_map ? \ + memory_bm_test_bit(page_resave_map, smp_processor_id(), page_to_pfn(page)) : 0) +#define SetPageResave(page) \ + (memory_bm_set_bit(page_resave_map, smp_processor_id(), page_to_pfn(page))) +#define ClearPageResave(page) \ + (memory_bm_clear_bit(page_resave_map, smp_processor_id(), page_to_pfn(page))) + +#define PageNosave(page) (nosave_map ? \ + memory_bm_test_bit(nosave_map, smp_processor_id(), page_to_pfn(page)) : 0) +#define SetPageNosave(page) \ + (mem_bm_set_bit_check(nosave_map, smp_processor_id(), page_to_pfn(page))) +#define ClearPageNosave(page) \ + (memory_bm_clear_bit(nosave_map, smp_processor_id(), page_to_pfn(page))) + +#define PageNosaveFree(page) (free_map ? \ + memory_bm_test_bit(free_map, smp_processor_id(), page_to_pfn(page)) : 0) +#define SetPageNosaveFree(page) \ + (memory_bm_set_bit(free_map, smp_processor_id(), page_to_pfn(page))) +#define ClearPageNosaveFree(page) \ + (memory_bm_clear_bit(free_map, smp_processor_id(), page_to_pfn(page))) + +#define PageCompareChanged(page) (compare_map ? \ + memory_bm_test_bit(compare_map, smp_processor_id(), page_to_pfn(page)) : 0) +#define SetPageCompareChanged(page) \ + (memory_bm_set_bit(compare_map, smp_processor_id(), page_to_pfn(page))) +#define ClearPageCompareChanged(page) \ + (memory_bm_clear_bit(compare_map, smp_processor_id(), page_to_pfn(page))) + +extern void save_pageflags(struct memory_bitmap *pagemap); +extern int load_pageflags(struct memory_bitmap *pagemap); +extern int toi_pageflags_space_needed(void); +#endif diff --git b/kernel/power/tuxonice_power_off.c b/kernel/power/tuxonice_power_off.c new file mode 100644 index 0000000..7c78773 --- /dev/null +++ b/kernel/power/tuxonice_power_off.c @@ -0,0 +1,286 @@ +/* + * kernel/power/tuxonice_power_off.c + * + * Copyright (C) 2006-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Support for powering down. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "tuxonice.h" +#include "tuxonice_ui.h" +#include "tuxonice_power_off.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_io.h" + +unsigned long toi_poweroff_method; /* 0 - Kernel power off */ + +static int wake_delay; +static char lid_state_file[256], wake_alarm_dir[256]; +static struct file *lid_file, *alarm_file, *epoch_file; +static int post_wake_state = -1; + +static int did_suspend_to_both; + +/* + * __toi_power_down + * Functionality : Powers down or reboots the computer once the image + * has been written to disk. + * Key Assumptions : Able to reboot/power down via code called or that + * the warning emitted if the calls fail will be visible + * to the user (ie printk resumes devices). + */ + +static void __toi_power_down(int method) +{ + int error; + + toi_cond_pause(1, test_action_state(TOI_REBOOT) ? "Ready to reboot." : + "Powering down."); + + if (test_result_state(TOI_ABORTED)) + goto out; + + if (test_action_state(TOI_REBOOT)) + kernel_restart(NULL); + + switch (method) { + case 0: + break; + case 3: + /* + * Re-read the overwritten part of pageset2 to make post-resume + * faster. + */ + if (read_pageset2(1)) + panic("Attempt to reload pagedir 2 failed. " + "Try rebooting."); + + pm_prepare_console(); + + error = pm_notifier_call_chain(PM_SUSPEND_PREPARE); + if (!error) { + pm_restore_gfp_mask(); + error = suspend_devices_and_enter(PM_SUSPEND_MEM); + pm_restrict_gfp_mask(); + if (!error) + did_suspend_to_both = 1; + } + pm_notifier_call_chain(PM_POST_SUSPEND); + pm_restore_console(); + + /* Success - we're now post-resume-from-ram */ + if (did_suspend_to_both) + return; + + /* Failed to suspend to ram - do normal power off */ + break; + case 4: + /* + * If succeeds, doesn't return. If fails, do a simple + * powerdown. + */ + hibernation_platform_enter(); + break; + case 5: + /* Historic entry only now */ + break; + } + + if (method && method != 5) + toi_cond_pause(1, + "Falling back to alternate power off method."); + + if (test_result_state(TOI_ABORTED)) + goto out; + + if (pm_power_off) + kernel_power_off(); + kernel_halt(); + toi_cond_pause(1, "Powerdown failed."); + while (1) + cpu_relax(); + +out: + if (read_pageset2(1)) + panic("Attempt to reload pagedir 2 failed. Try rebooting."); + return; +} + +#define CLOSE_FILE(file) \ + if (file) { \ + filp_close(file, NULL); file = NULL; \ + } + +static void powerdown_cleanup(int toi_or_resume) +{ + if (!toi_or_resume) + return; + + CLOSE_FILE(lid_file); + CLOSE_FILE(alarm_file); + CLOSE_FILE(epoch_file); +} + +static void open_file(char *format, char *arg, struct file **var, int mode, + char *desc) +{ + char buf[256]; + + if (strlen(arg)) { + sprintf(buf, format, arg); + *var = filp_open(buf, mode, 0); + if (IS_ERR(*var) || !*var) { + printk(KERN_INFO "Failed to open %s file '%s' (%p).\n", + desc, buf, *var); + *var = NULL; + } + } +} + +static int powerdown_init(int toi_or_resume) +{ + if (!toi_or_resume) + return 0; + + did_suspend_to_both = 0; + + open_file("/proc/acpi/button/%s/state", lid_state_file, &lid_file, + O_RDONLY, "lid"); + + if (strlen(wake_alarm_dir)) { + open_file("/sys/class/rtc/%s/wakealarm", wake_alarm_dir, + &alarm_file, O_WRONLY, "alarm"); + + open_file("/sys/class/rtc/%s/since_epoch", wake_alarm_dir, + &epoch_file, O_RDONLY, "epoch"); + } + + return 0; +} + +static int lid_closed(void) +{ + char array[25]; + ssize_t size; + loff_t pos = 0; + + if (!lid_file) + return 0; + + size = vfs_read(lid_file, (char __user *) array, 25, &pos); + if ((int) size < 1) { + printk(KERN_INFO "Failed to read lid state file (%d).\n", + (int) size); + return 0; + } + + if (!strcmp(array, "state: closed\n")) + return 1; + + return 0; +} + +static void write_alarm_file(int value) +{ + ssize_t size; + char buf[40]; + loff_t pos = 0; + + if (!alarm_file) + return; + + sprintf(buf, "%d\n", value); + + size = vfs_write(alarm_file, (char __user *)buf, strlen(buf), &pos); + + if (size < 0) + printk(KERN_INFO "Error %d writing alarm value %s.\n", + (int) size, buf); +} + +/** + * toi_check_resleep: See whether to powerdown again after waking. + * + * After waking, check whether we should powerdown again in a (usually + * different) way. We only do this if the lid switch is still closed. + */ +void toi_check_resleep(void) +{ + /* We only return if we suspended to ram and woke. */ + if (lid_closed() && post_wake_state >= 0) + __toi_power_down(post_wake_state); +} + +void toi_power_down(void) +{ + if (alarm_file && wake_delay) { + char array[25]; + loff_t pos = 0; + size_t size = vfs_read(epoch_file, (char __user *) array, 25, + &pos); + + if (((int) size) < 1) + printk(KERN_INFO "Failed to read epoch file (%d).\n", + (int) size); + else { + unsigned long since_epoch; + if (!kstrtoul(array, 0, &since_epoch)) { + /* Clear any wakeup time. */ + write_alarm_file(0); + + /* Set new wakeup time. */ + write_alarm_file(since_epoch + wake_delay); + } + } + } + + __toi_power_down(toi_poweroff_method); + + toi_check_resleep(); +} + +static struct toi_sysfs_data sysfs_params[] = { +#if defined(CONFIG_ACPI) + SYSFS_STRING("lid_file", SYSFS_RW, lid_state_file, 256, 0, NULL), + SYSFS_INT("wake_delay", SYSFS_RW, &wake_delay, 0, INT_MAX, 0, NULL), + SYSFS_STRING("wake_alarm_dir", SYSFS_RW, wake_alarm_dir, 256, 0, NULL), + SYSFS_INT("post_wake_state", SYSFS_RW, &post_wake_state, -1, 5, 0, + NULL), + SYSFS_UL("powerdown_method", SYSFS_RW, &toi_poweroff_method, 0, 5, 0), + SYSFS_INT("did_suspend_to_both", SYSFS_READONLY, &did_suspend_to_both, + 0, 0, 0, NULL) +#endif +}; + +static struct toi_module_ops powerdown_ops = { + .type = MISC_HIDDEN_MODULE, + .name = "poweroff", + .initialise = powerdown_init, + .cleanup = powerdown_cleanup, + .directory = "[ROOT]", + .module = THIS_MODULE, + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +int toi_poweroff_init(void) +{ + return toi_register_module(&powerdown_ops); +} + +void toi_poweroff_exit(void) +{ + toi_unregister_module(&powerdown_ops); +} diff --git b/kernel/power/tuxonice_power_off.h b/kernel/power/tuxonice_power_off.h new file mode 100644 index 0000000..6e1d8bb --- /dev/null +++ b/kernel/power/tuxonice_power_off.h @@ -0,0 +1,24 @@ +/* + * kernel/power/tuxonice_power_off.h + * + * Copyright (C) 2006-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Support for the powering down. + */ + +int toi_pm_state_finish(void); +void toi_power_down(void); +extern unsigned long toi_poweroff_method; +int toi_poweroff_init(void); +void toi_poweroff_exit(void); +void toi_check_resleep(void); + +extern int platform_begin(int platform_mode); +extern int platform_pre_snapshot(int platform_mode); +extern void platform_leave(int platform_mode); +extern void platform_end(int platform_mode); +extern void platform_finish(int platform_mode); +extern int platform_pre_restore(int platform_mode); +extern void platform_restore_cleanup(int platform_mode); diff --git b/kernel/power/tuxonice_prepare_image.c b/kernel/power/tuxonice_prepare_image.c new file mode 100644 index 0000000..a10d620 --- /dev/null +++ b/kernel/power/tuxonice_prepare_image.c @@ -0,0 +1,1080 @@ +/* + * kernel/power/tuxonice_prepare_image.c + * + * Copyright (C) 2003-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * We need to eat memory until we can: + * 1. Perform the save without changing anything (RAM_NEEDED < #pages) + * 2. Fit it all in available space (toiActiveAllocator->available_space() >= + * main_storage_needed()) + * 3. Reload the pagedir and pageset1 to places that don't collide with their + * final destinations, not knowing to what extent the resumed kernel will + * overlap with the one loaded at boot time. I think the resumed kernel + * should overlap completely, but I don't want to rely on this as it is + * an unproven assumption. We therefore assume there will be no overlap at + * all (worse case). + * 4. Meet the user's requested limit (if any) on the size of the image. + * The limit is in MB, so pages/256 (assuming 4K pages). + * + */ + +#include +#include +#include +#include +#include +#include + +#include "tuxonice_pageflags.h" +#include "tuxonice_modules.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice.h" +#include "tuxonice_extent.h" +#include "tuxonice_checksum.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_alloc.h" +#include "tuxonice_atomic_copy.h" +#include "tuxonice_builtin.h" + +static unsigned long num_nosave, main_storage_allocated, storage_limit, + header_storage_needed; +unsigned long extra_pd1_pages_allowance = + CONFIG_TOI_DEFAULT_EXTRA_PAGES_ALLOWANCE; +long image_size_limit = CONFIG_TOI_DEFAULT_IMAGE_SIZE_LIMIT; +static int no_ps2_needed; + +struct attention_list { + struct task_struct *task; + struct attention_list *next; +}; + +static struct attention_list *attention_list; + +#define PAGESET1 0 +#define PAGESET2 1 + +void free_attention_list(void) +{ + struct attention_list *last = NULL; + + while (attention_list) { + last = attention_list; + attention_list = attention_list->next; + toi_kfree(6, last, sizeof(*last)); + } +} + +static int build_attention_list(void) +{ + int i, task_count = 0; + struct task_struct *p; + struct attention_list *next; + + /* + * Count all userspace process (with task->mm) marked PF_NOFREEZE. + */ + toi_read_lock_tasklist(); + for_each_process(p) + if ((p->flags & PF_NOFREEZE) || p == current) + task_count++; + toi_read_unlock_tasklist(); + + /* + * Allocate attention list structs. + */ + for (i = 0; i < task_count; i++) { + struct attention_list *this = + toi_kzalloc(6, sizeof(struct attention_list), + TOI_WAIT_GFP); + if (!this) { + printk(KERN_INFO "Failed to allocate slab for " + "attention list.\n"); + free_attention_list(); + return 1; + } + this->next = NULL; + if (attention_list) + this->next = attention_list; + attention_list = this; + } + + next = attention_list; + toi_read_lock_tasklist(); + for_each_process(p) + if ((p->flags & PF_NOFREEZE) || p == current) { + next->task = p; + next = next->next; + } + toi_read_unlock_tasklist(); + return 0; +} + +static void pageset2_full(void) +{ + struct zone *zone; + struct page *page; + unsigned long flags; + int i; + + toi_trace_index++; + + for_each_populated_zone(zone) { + spin_lock_irqsave(&zone->lru_lock, flags); + for_each_lru(i) { + if (!zone_page_state(zone, NR_LRU_BASE + i)) + continue; + + list_for_each_entry(page, &zone->lruvec.lists[i], lru) { + struct address_space *mapping; + + mapping = page_mapping(page); + if (!mapping || !mapping->host || + !(mapping->host->i_flags & S_ATOMIC_COPY)) { + if (PageTOI_RO(page) && test_result_state(TOI_KEPT_IMAGE)) { + TOI_TRACE_DEBUG(page_to_pfn(page), "_Pageset2 unmodified."); + } else { + TOI_TRACE_DEBUG(page_to_pfn(page), "_Pageset2 pageset2_full."); + SetPagePageset2(page); + } + } + } + } + spin_unlock_irqrestore(&zone->lru_lock, flags); + } +} + +/* + * toi_mark_task_as_pageset + * Functionality : Marks all the saveable pages belonging to a given process + * as belonging to a particular pageset. + */ + +static void toi_mark_task_as_pageset(struct task_struct *t, int pageset2) +{ + struct vm_area_struct *vma; + struct mm_struct *mm; + + mm = t->active_mm; + + if (!mm || !mm->mmap) + return; + + toi_trace_index++; + + if (!irqs_disabled()) + down_read(&mm->mmap_sem); + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long posn; + + if (!vma->vm_start || + vma->vm_flags & VM_PFNMAP) + continue; + + for (posn = vma->vm_start; posn < vma->vm_end; + posn += PAGE_SIZE) { + struct page *page = follow_page(vma, posn, 0); + struct address_space *mapping; + + if (!page || !pfn_valid(page_to_pfn(page))) + continue; + + mapping = page_mapping(page); + if (mapping && mapping->host && + mapping->host->i_flags & S_ATOMIC_COPY && pageset2) + continue; + + if (PageTOI_RO(page) && test_result_state(TOI_KEPT_IMAGE)) { + TOI_TRACE_DEBUG(page_to_pfn(page), "_Unmodified %d", pageset2 ? 1 : 2); + continue; + } + + if (pageset2) { + TOI_TRACE_DEBUG(page_to_pfn(page), "_MarkTaskAsPageset 1"); + SetPagePageset2(page); + } else { + TOI_TRACE_DEBUG(page_to_pfn(page), "_MarkTaskAsPageset 2"); + ClearPagePageset2(page); + SetPagePageset1(page); + } + } + } + + if (!irqs_disabled()) + up_read(&mm->mmap_sem); +} + +static void mark_tasks(int pageset) +{ + struct task_struct *p; + + toi_read_lock_tasklist(); + for_each_process(p) { + if (!p->mm) + continue; + + if (p->flags & PF_KTHREAD) + continue; + + toi_mark_task_as_pageset(p, pageset); + } + toi_read_unlock_tasklist(); + +} + +/* mark_pages_for_pageset2 + * + * Description: Mark unshared pages in processes not needed for hibernate as + * being able to be written out in a separate pagedir. + * HighMem pages are simply marked as pageset2. They won't be + * needed during hibernate. + */ + +static void toi_mark_pages_for_pageset2(void) +{ + struct attention_list *this = attention_list; + + memory_bm_clear(pageset2_map); + + if (test_action_state(TOI_NO_PAGESET2) || no_ps2_needed) + return; + + if (test_action_state(TOI_PAGESET2_FULL)) + pageset2_full(); + else + mark_tasks(PAGESET2); + + /* + * Because the tasks in attention_list are ones related to hibernating, + * we know that they won't go away under us. + */ + + while (this) { + if (!test_result_state(TOI_ABORTED)) + toi_mark_task_as_pageset(this->task, PAGESET1); + this = this->next; + } +} + +/* + * The atomic copy of pageset1 is stored in pageset2 pages. + * But if pageset1 is larger (normally only just after boot), + * we need to allocate extra pages to store the atomic copy. + * The following data struct and functions are used to handle + * the allocation and freeing of that memory. + */ + +static unsigned long extra_pages_allocated; + +struct extras { + struct page *page; + int order; + struct extras *next; +}; + +static struct extras *extras_list; + +/* toi_free_extra_pagedir_memory + * + * Description: Free previously allocated extra pagedir memory. + */ +void toi_free_extra_pagedir_memory(void) +{ + /* Free allocated pages */ + while (extras_list) { + struct extras *this = extras_list; + int i; + + extras_list = this->next; + + for (i = 0; i < (1 << this->order); i++) + ClearPageNosave(this->page + i); + + toi_free_pages(9, this->page, this->order); + toi_kfree(7, this, sizeof(*this)); + } + + extra_pages_allocated = 0; +} + +/* toi_allocate_extra_pagedir_memory + * + * Description: Allocate memory for making the atomic copy of pagedir1 in the + * case where it is bigger than pagedir2. + * Arguments: int num_to_alloc: Number of extra pages needed. + * Result: int. Number of extra pages we now have allocated. + */ +static int toi_allocate_extra_pagedir_memory(int extra_pages_needed) +{ + int j, order, num_to_alloc = extra_pages_needed - extra_pages_allocated; + gfp_t flags = TOI_ATOMIC_GFP; + + if (num_to_alloc < 1) + return 0; + + order = fls(num_to_alloc); + if (order >= MAX_ORDER) + order = MAX_ORDER - 1; + + while (num_to_alloc) { + struct page *newpage; + unsigned long virt; + struct extras *extras_entry; + + while ((1 << order) > num_to_alloc) + order--; + + extras_entry = (struct extras *) toi_kzalloc(7, + sizeof(struct extras), TOI_ATOMIC_GFP); + + if (!extras_entry) + return extra_pages_allocated; + + virt = toi_get_free_pages(9, flags, order); + while (!virt && order) { + order--; + virt = toi_get_free_pages(9, flags, order); + } + + if (!virt) { + toi_kfree(7, extras_entry, sizeof(*extras_entry)); + return extra_pages_allocated; + } + + newpage = virt_to_page(virt); + + extras_entry->page = newpage; + extras_entry->order = order; + extras_entry->next = extras_list; + + extras_list = extras_entry; + + for (j = 0; j < (1 << order); j++) { + SetPageNosave(newpage + j); + SetPagePageset1Copy(newpage + j); + } + + extra_pages_allocated += (1 << order); + num_to_alloc -= (1 << order); + } + + return extra_pages_allocated; +} + +/* + * real_nr_free_pages: Count pcp pages for a zone type or all zones + * (-1 for all, otherwise zone_idx() result desired). + */ +unsigned long real_nr_free_pages(unsigned long zone_idx_mask) +{ + struct zone *zone; + int result = 0, cpu; + + /* PCP lists */ + for_each_populated_zone(zone) { + if (!(zone_idx_mask & (1 << zone_idx(zone)))) + continue; + + for_each_online_cpu(cpu) { + struct per_cpu_pageset *pset = + per_cpu_ptr(zone->pageset, cpu); + struct per_cpu_pages *pcp = &pset->pcp; + result += pcp->count; + } + + result += zone_page_state(zone, NR_FREE_PAGES); + } + return result; +} + +/* + * Discover how much extra memory will be required by the drivers + * when they're asked to hibernate. We can then ensure that amount + * of memory is available when we really want it. + */ +static void get_extra_pd1_allowance(void) +{ + unsigned long orig_num_free = real_nr_free_pages(all_zones_mask), final; + + toi_prepare_status(CLEAR_BAR, "Finding allowance for drivers."); + + if (toi_go_atomic(PMSG_FREEZE, 1)) + return; + + final = real_nr_free_pages(all_zones_mask); + toi_end_atomic(ATOMIC_ALL_STEPS, 1, 0); + + extra_pd1_pages_allowance = (orig_num_free > final) ? + orig_num_free - final + MIN_EXTRA_PAGES_ALLOWANCE : + MIN_EXTRA_PAGES_ALLOWANCE; +} + +/* + * Amount of storage needed, possibly taking into account the + * expected compression ratio and possibly also ignoring our + * allowance for extra pages. + */ +static unsigned long main_storage_needed(int use_ecr, + int ignore_extra_pd1_allow) +{ + return (pagedir1.size + pagedir2.size + + (ignore_extra_pd1_allow ? 0 : extra_pd1_pages_allowance)) * + (use_ecr ? toi_expected_compression_ratio() : 100) / 100; +} + +/* + * Storage needed for the image header, in bytes until the return. + */ +unsigned long get_header_storage_needed(void) +{ + unsigned long bytes = sizeof(struct toi_header) + + toi_header_storage_for_modules() + + toi_pageflags_space_needed() + + fs_info_space_needed(); + + return DIV_ROUND_UP(bytes, PAGE_SIZE); +} + +/* + * When freeing memory, pages from either pageset might be freed. + * + * When seeking to free memory to be able to hibernate, for every ps1 page + * freed, we need 2 less pages for the atomic copy because there is one less + * page to copy and one more page into which data can be copied. + * + * Freeing ps2 pages saves us nothing directly. No more memory is available + * for the atomic copy. Indirectly, a ps1 page might be freed (slab?), but + * that's too much work to figure out. + * + * => ps1_to_free functions + * + * Of course if we just want to reduce the image size, because of storage + * limitations or an image size limit either ps will do. + * + * => any_to_free function + */ + +static unsigned long lowpages_usable_for_highmem_copy(void) +{ + unsigned long needed = get_lowmem_size(pagedir1) + + extra_pd1_pages_allowance + MIN_FREE_RAM + + toi_memory_for_modules(0), + available = get_lowmem_size(pagedir2) + + real_nr_free_low_pages() + extra_pages_allocated; + + return available > needed ? available - needed : 0; +} + +static unsigned long highpages_ps1_to_free(void) +{ + unsigned long need = get_highmem_size(pagedir1), + available = get_highmem_size(pagedir2) + + real_nr_free_high_pages() + + lowpages_usable_for_highmem_copy(); + + return need > available ? DIV_ROUND_UP(need - available, 2) : 0; +} + +static unsigned long lowpages_ps1_to_free(void) +{ + unsigned long needed = get_lowmem_size(pagedir1) + + extra_pd1_pages_allowance + MIN_FREE_RAM + + toi_memory_for_modules(0), + available = get_lowmem_size(pagedir2) + + real_nr_free_low_pages() + extra_pages_allocated; + + return needed > available ? DIV_ROUND_UP(needed - available, 2) : 0; +} + +static unsigned long current_image_size(void) +{ + return pagedir1.size + pagedir2.size + header_storage_needed; +} + +static unsigned long storage_still_required(void) +{ + unsigned long needed = main_storage_needed(1, 1); + return needed > storage_limit ? needed - storage_limit : 0; +} + +static unsigned long ram_still_required(void) +{ + unsigned long needed = MIN_FREE_RAM + toi_memory_for_modules(0) + + 2 * extra_pd1_pages_allowance, + available = real_nr_free_low_pages() + extra_pages_allocated; + return needed > available ? needed - available : 0; +} + +unsigned long any_to_free(int use_image_size_limit) +{ + int use_soft_limit = use_image_size_limit && image_size_limit > 0; + unsigned long current_size = current_image_size(), + soft_limit = use_soft_limit ? (image_size_limit << 8) : 0, + to_free = use_soft_limit ? (current_size > soft_limit ? + current_size - soft_limit : 0) : 0, + storage_limit = storage_still_required(), + ram_limit = ram_still_required(), + first_max = max(to_free, storage_limit); + + return max(first_max, ram_limit); +} + +static int need_pageset2(void) +{ + return (real_nr_free_low_pages() + extra_pages_allocated - + 2 * extra_pd1_pages_allowance - MIN_FREE_RAM - + toi_memory_for_modules(0) - pagedir1.size) < pagedir2.size; +} + +/* amount_needed + * + * Calculates the amount by which the image size needs to be reduced to meet + * our constraints. + */ +static unsigned long amount_needed(int use_image_size_limit) +{ + return max(highpages_ps1_to_free() + lowpages_ps1_to_free(), + any_to_free(use_image_size_limit)); +} + +static int image_not_ready(int use_image_size_limit) +{ + toi_message(TOI_EAT_MEMORY, TOI_LOW, 1, + "Amount still needed (%lu) > 0:%u," + " Storage allocd: %lu < %lu: %u.\n", + amount_needed(use_image_size_limit), + (amount_needed(use_image_size_limit) > 0), + main_storage_allocated, + main_storage_needed(1, 1), + main_storage_allocated < main_storage_needed(1, 1)); + + toi_cond_pause(0, NULL); + + return (amount_needed(use_image_size_limit) > 0) || + main_storage_allocated < main_storage_needed(1, 1); +} + +static void display_failure_reason(int tries_exceeded) +{ + unsigned long storage_required = storage_still_required(), + ram_required = ram_still_required(), + high_ps1 = highpages_ps1_to_free(), + low_ps1 = lowpages_ps1_to_free(); + + printk(KERN_INFO "Failed to prepare the image because...\n"); + + if (!storage_limit) { + printk(KERN_INFO "- You need some storage available to be " + "able to hibernate.\n"); + return; + } + + if (tries_exceeded) + printk(KERN_INFO "- The maximum number of iterations was " + "reached without successfully preparing the " + "image.\n"); + + if (storage_required) { + printk(KERN_INFO " - We need at least %lu pages of storage " + "(ignoring the header), but only have %lu.\n", + main_storage_needed(1, 1), + main_storage_allocated); + set_abort_result(TOI_INSUFFICIENT_STORAGE); + } + + if (ram_required) { + printk(KERN_INFO " - We need %lu more free pages of low " + "memory.\n", ram_required); + printk(KERN_INFO " Minimum free : %8d\n", MIN_FREE_RAM); + printk(KERN_INFO " + Reqd. by modules : %8lu\n", + toi_memory_for_modules(0)); + printk(KERN_INFO " + 2 * extra allow : %8lu\n", + 2 * extra_pd1_pages_allowance); + printk(KERN_INFO " - Currently free : %8lu\n", + real_nr_free_low_pages()); + printk(KERN_INFO " - Pages allocd : %8lu\n", + extra_pages_allocated); + printk(KERN_INFO " : ========\n"); + printk(KERN_INFO " Still needed : %8lu\n", + ram_required); + + /* Print breakdown of memory needed for modules */ + toi_memory_for_modules(1); + set_abort_result(TOI_UNABLE_TO_FREE_ENOUGH_MEMORY); + } + + if (high_ps1) { + printk(KERN_INFO "- We need to free %lu highmem pageset 1 " + "pages.\n", high_ps1); + set_abort_result(TOI_UNABLE_TO_FREE_ENOUGH_MEMORY); + } + + if (low_ps1) { + printk(KERN_INFO " - We need to free %ld lowmem pageset 1 " + "pages.\n", low_ps1); + set_abort_result(TOI_UNABLE_TO_FREE_ENOUGH_MEMORY); + } +} + +static void display_stats(int always, int sub_extra_pd1_allow) +{ + char buffer[255]; + snprintf(buffer, 254, + "Free:%lu(%lu). Sets:%lu(%lu),%lu(%lu). " + "Nosave:%lu-%lu=%lu. Storage:%lu/%lu(%lu=>%lu). " + "Needed:%lu,%lu,%lu(%u,%lu,%lu,%ld) (PS2:%s)\n", + + /* Free */ + real_nr_free_pages(all_zones_mask), + real_nr_free_low_pages(), + + /* Sets */ + pagedir1.size, pagedir1.size - get_highmem_size(pagedir1), + pagedir2.size, pagedir2.size - get_highmem_size(pagedir2), + + /* Nosave */ + num_nosave, extra_pages_allocated, + num_nosave - extra_pages_allocated, + + /* Storage */ + main_storage_allocated, + storage_limit, + main_storage_needed(1, sub_extra_pd1_allow), + main_storage_needed(1, 1), + + /* Needed */ + lowpages_ps1_to_free(), highpages_ps1_to_free(), + any_to_free(1), + MIN_FREE_RAM, toi_memory_for_modules(0), + extra_pd1_pages_allowance, + image_size_limit, + + need_pageset2() ? "yes" : "no"); + + if (always) + printk("%s", buffer); + else + toi_message(TOI_EAT_MEMORY, TOI_MEDIUM, 1, buffer); +} + +/* flag_image_pages + * + * This routine generates our lists of pages to be stored in each + * pageset. Since we store the data using extents, and adding new + * extents might allocate a new extent page, this routine may well + * be called more than once. + */ +static void flag_image_pages(int atomic_copy) +{ + int num_free = 0, num_unmodified = 0; + unsigned long loop; + struct zone *zone; + + pagedir1.size = 0; + pagedir2.size = 0; + + set_highmem_size(pagedir1, 0); + set_highmem_size(pagedir2, 0); + + num_nosave = 0; + toi_trace_index++; + + memory_bm_clear(pageset1_map); + + toi_generate_free_page_map(); + + /* + * Pages not to be saved are marked Nosave irrespective of being + * reserved. + */ + for_each_populated_zone(zone) { + int highmem = is_highmem(zone); + + for (loop = 0; loop < zone->spanned_pages; loop++) { + unsigned long pfn = zone->zone_start_pfn + loop; + struct page *page; + int chunk_size; + + if (!pfn_valid(pfn)) { + TOI_TRACE_DEBUG(pfn, "_Flag Invalid"); + continue; + } + + chunk_size = toi_size_of_free_region(zone, pfn); + if (chunk_size) { + unsigned long y; + for (y = pfn; y < pfn + chunk_size; y++) { + page = pfn_to_page(y); + TOI_TRACE_DEBUG(y, "_Flag Free"); + ClearPagePageset1(page); + ClearPagePageset2(page); + } + num_free += chunk_size; + loop += chunk_size - 1; + continue; + } + + page = pfn_to_page(pfn); + + if (PageNosave(page)) { + char *desc = PagePageset1Copy(page) ? "Pageset1Copy" : "NoSave"; + TOI_TRACE_DEBUG(pfn, "_Flag %s", desc); + num_nosave++; + continue; + } + + page = highmem ? saveable_highmem_page(zone, pfn) : + saveable_page(zone, pfn); + + if (!page) { + TOI_TRACE_DEBUG(pfn, "_Flag Nosave2"); + num_nosave++; + continue; + } + + if (PageTOI_RO(page) && test_result_state(TOI_KEPT_IMAGE)) { + TOI_TRACE_DEBUG(pfn, "_Unmodified"); + num_unmodified++; + continue; + } + + if (PagePageset2(page)) { + pagedir2.size++; + TOI_TRACE_DEBUG(pfn, "_Flag PS2"); + if (PageHighMem(page)) + inc_highmem_size(pagedir2); + else + SetPagePageset1Copy(page); + if (PageResave(page)) { + SetPagePageset1(page); + ClearPagePageset1Copy(page); + pagedir1.size++; + if (PageHighMem(page)) + inc_highmem_size(pagedir1); + } + } else { + pagedir1.size++; + TOI_TRACE_DEBUG(pfn, "_Flag PS1"); + SetPagePageset1(page); + if (PageHighMem(page)) + inc_highmem_size(pagedir1); + } + } + } + + if (!atomic_copy) + toi_message(TOI_EAT_MEMORY, TOI_MEDIUM, 0, + "Count data pages: Set1 (%d) + Set2 (%d) + Nosave (%ld)" + " + Unmodified (%d) + NumFree (%d) = %d.\n", + pagedir1.size, pagedir2.size, num_nosave, num_unmodified, + num_free, pagedir1.size + pagedir2.size + num_nosave + num_free); +} + +void toi_recalculate_image_contents(int atomic_copy) +{ + memory_bm_clear(pageset1_map); + if (!atomic_copy) { + unsigned long pfn; + memory_bm_position_reset(pageset2_map); + for (pfn = memory_bm_next_pfn(pageset2_map, 0); + pfn != BM_END_OF_MAP; + pfn = memory_bm_next_pfn(pageset2_map, 0)) + ClearPagePageset1Copy(pfn_to_page(pfn)); + /* Need to call this before getting pageset1_size! */ + toi_mark_pages_for_pageset2(); + } + memory_bm_position_reset(pageset2_map); + flag_image_pages(atomic_copy); + + if (!atomic_copy) { + storage_limit = toiActiveAllocator->storage_available(); + display_stats(0, 0); + } +} + +int try_allocate_extra_memory(void) +{ + unsigned long wanted = pagedir1.size + extra_pd1_pages_allowance - + get_lowmem_size(pagedir2); + if (wanted > extra_pages_allocated) { + unsigned long got = toi_allocate_extra_pagedir_memory(wanted); + if (wanted < got) { + toi_message(TOI_EAT_MEMORY, TOI_LOW, 1, + "Want %d extra pages for pageset1, got %d.\n", + wanted, got); + return 1; + } + } + return 0; +} + +/* update_image + * + * Allocate [more] memory and storage for the image. + */ +static void update_image(int ps2_recalc) +{ + int old_header_req; + unsigned long seek; + + if (try_allocate_extra_memory()) + return; + + if (ps2_recalc) + goto recalc; + + thaw_kernel_threads(); + + /* + * Allocate remaining storage space, if possible, up to the + * maximum we know we'll need. It's okay to allocate the + * maximum if the writer is the swapwriter, but + * we don't want to grab all available space on an NFS share. + * We therefore ignore the expected compression ratio here, + * thereby trying to allocate the maximum image size we could + * need (assuming compression doesn't expand the image), but + * don't complain if we can't get the full amount we're after. + */ + + do { + int result; + + old_header_req = header_storage_needed; + toiActiveAllocator->reserve_header_space(header_storage_needed); + + /* How much storage is free with the reservation applied? */ + storage_limit = toiActiveAllocator->storage_available(); + seek = min(storage_limit, main_storage_needed(0, 0)); + + result = toiActiveAllocator->allocate_storage(seek); + if (result) + printk("Failed to allocate storage (%d).\n", result); + + main_storage_allocated = + toiActiveAllocator->storage_allocated(); + + /* Need more header because more storage allocated? */ + header_storage_needed = get_header_storage_needed(); + + } while (header_storage_needed > old_header_req); + + if (freeze_kernel_threads()) + set_abort_result(TOI_FREEZING_FAILED); + +recalc: + toi_recalculate_image_contents(0); +} + +/* attempt_to_freeze + * + * Try to freeze processes. + */ + +static int attempt_to_freeze(void) +{ + int result; + + /* Stop processes before checking again */ + toi_prepare_status(CLEAR_BAR, "Freezing processes & syncing " + "filesystems."); + result = freeze_processes(); + + if (result) + set_abort_result(TOI_FREEZING_FAILED); + + result = freeze_kernel_threads(); + + if (result) + set_abort_result(TOI_FREEZING_FAILED); + + return result; +} + +/* eat_memory + * + * Try to free some memory, either to meet hard or soft constraints on the image + * characteristics. + * + * Hard constraints: + * - Pageset1 must be < half of memory; + * - We must have enough memory free at resume time to have pageset1 + * be able to be loaded in pages that don't conflict with where it has to + * be restored. + * Soft constraints + * - User specificied image size limit. + */ +static void eat_memory(void) +{ + unsigned long amount_wanted = 0; + int did_eat_memory = 0; + + /* + * Note that if we have enough storage space and enough free memory, we + * may exit without eating anything. We give up when the last 10 + * iterations ate no extra pages because we're not going to get much + * more anyway, but the few pages we get will take a lot of time. + * + * We freeze processes before beginning, and then unfreeze them if we + * need to eat memory until we think we have enough. If our attempts + * to freeze fail, we give up and abort. + */ + + amount_wanted = amount_needed(1); + + switch (image_size_limit) { + case -1: /* Don't eat any memory */ + if (amount_wanted > 0) { + set_abort_result(TOI_WOULD_EAT_MEMORY); + return; + } + break; + case -2: /* Free caches only */ + drop_pagecache(); + toi_recalculate_image_contents(0); + amount_wanted = amount_needed(1); + break; + default: + break; + } + + if (amount_wanted > 0 && !test_result_state(TOI_ABORTED) && + image_size_limit != -1) { + unsigned long request = amount_wanted; + unsigned long high_req = max(highpages_ps1_to_free(), + any_to_free(1)); + unsigned long low_req = lowpages_ps1_to_free(); + unsigned long got = 0; + + toi_prepare_status(CLEAR_BAR, + "Seeking to free %ldMB of memory.", + MB(amount_wanted)); + + thaw_kernel_threads(); + + /* + * Ask for too many because shrink_memory_mask doesn't + * currently return enough most of the time. + */ + + if (low_req) + got = shrink_memory_mask(low_req, GFP_KERNEL); + if (high_req) + shrink_memory_mask(high_req - got, GFP_HIGHUSER); + + did_eat_memory = 1; + + toi_recalculate_image_contents(0); + + amount_wanted = amount_needed(1); + + printk(KERN_DEBUG "Asked shrink_memory_mask for %ld low pages &" + " %ld pages from anywhere, got %ld.\n", + high_req, low_req, + request - amount_wanted); + + toi_cond_pause(0, NULL); + + if (freeze_kernel_threads()) + set_abort_result(TOI_FREEZING_FAILED); + } + + if (did_eat_memory) + toi_recalculate_image_contents(0); +} + +/* toi_prepare_image + * + * Entry point to the whole image preparation section. + * + * We do four things: + * - Freeze processes; + * - Ensure image size constraints are met; + * - Complete all the preparation for saving the image, + * including allocation of storage. The only memory + * that should be needed when we're finished is that + * for actually storing the image (and we know how + * much is needed for that because the modules tell + * us). + * - Make sure that all dirty buffers are written out. + */ +#define MAX_TRIES 2 +int toi_prepare_image(void) +{ + int result = 1, tries = 1; + + main_storage_allocated = 0; + no_ps2_needed = 0; + + if (attempt_to_freeze()) + return 1; + + lock_device_hotplug(); + set_toi_state(TOI_DEVICE_HOTPLUG_LOCKED); + + if (!extra_pd1_pages_allowance) + get_extra_pd1_allowance(); + + storage_limit = toiActiveAllocator->storage_available(); + + if (!storage_limit) { + printk(KERN_INFO "No storage available. Didn't try to prepare " + "an image.\n"); + display_failure_reason(0); + set_abort_result(TOI_NOSTORAGE_AVAILABLE); + return 1; + } + + if (build_attention_list()) { + abort_hibernate(TOI_UNABLE_TO_PREPARE_IMAGE, + "Unable to successfully prepare the image.\n"); + return 1; + } + + toi_recalculate_image_contents(0); + + do { + toi_prepare_status(CLEAR_BAR, + "Preparing Image. Try %d.", tries); + + eat_memory(); + + if (test_result_state(TOI_ABORTED)) + break; + + update_image(0); + + tries++; + + } while (image_not_ready(1) && tries <= MAX_TRIES && + !test_result_state(TOI_ABORTED)); + + result = image_not_ready(0); + + /* TODO: Handle case where need to remove existing image and resave + * instead of adding to incremental image. */ + + if (!test_result_state(TOI_ABORTED)) { + if (result) { + display_stats(1, 0); + display_failure_reason(tries > MAX_TRIES); + abort_hibernate(TOI_UNABLE_TO_PREPARE_IMAGE, + "Unable to successfully prepare the image.\n"); + } else { + /* Pageset 2 needed? */ + if (!need_pageset2() && + test_action_state(TOI_NO_PS2_IF_UNNEEDED)) { + no_ps2_needed = 1; + toi_recalculate_image_contents(0); + update_image(1); + } + + toi_cond_pause(1, "Image preparation complete."); + } + } + + return result ? result : allocate_checksum_pages(); +} diff --git b/kernel/power/tuxonice_prepare_image.h b/kernel/power/tuxonice_prepare_image.h new file mode 100644 index 0000000..c150897 --- /dev/null +++ b/kernel/power/tuxonice_prepare_image.h @@ -0,0 +1,38 @@ +/* + * kernel/power/tuxonice_prepare_image.h + * + * Copyright (C) 2003-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + */ + +#include + +extern int toi_prepare_image(void); +extern void toi_recalculate_image_contents(int storage_available); +extern unsigned long real_nr_free_pages(unsigned long zone_idx_mask); +extern long image_size_limit; +extern void toi_free_extra_pagedir_memory(void); +extern unsigned long extra_pd1_pages_allowance; +extern void free_attention_list(void); + +#define MIN_FREE_RAM 100 +#define MIN_EXTRA_PAGES_ALLOWANCE 500 + +#define all_zones_mask ((unsigned long) ((1 << MAX_NR_ZONES) - 1)) +#ifdef CONFIG_HIGHMEM +#define real_nr_free_high_pages() (real_nr_free_pages(1 << ZONE_HIGHMEM)) +#define real_nr_free_low_pages() (real_nr_free_pages(all_zones_mask - \ + (1 << ZONE_HIGHMEM))) +#else +#define real_nr_free_high_pages() (0) +#define real_nr_free_low_pages() (real_nr_free_pages(all_zones_mask)) + +/* For eat_memory function */ +#define ZONE_HIGHMEM (MAX_NR_ZONES + 1) +#endif + +unsigned long get_header_storage_needed(void); +unsigned long any_to_free(int use_image_size_limit); +int try_allocate_extra_memory(void); diff --git b/kernel/power/tuxonice_prune.c b/kernel/power/tuxonice_prune.c new file mode 100644 index 0000000..5bc56d3 --- /dev/null +++ b/kernel/power/tuxonice_prune.c @@ -0,0 +1,406 @@ +/* + * kernel/power/tuxonice_prune.c + * + * Copyright (C) 2012 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * This file implements a TuxOnIce module that seeks to prune the + * amount of data written to disk. It builds a table of hashes + * of the uncompressed data, and writes the pfn of the previous page + * with the same contents instead of repeating the data when a match + * is found. + */ + +#include +#include +#include +#include +#include +#include + +#include "tuxonice_builtin.h" +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_alloc.h" + +/* + * We never write a page bigger than PAGE_SIZE, so use a large number + * to indicate that data is a PFN. + */ +#define PRUNE_DATA_IS_PFN (PAGE_SIZE + 100) + +static unsigned long toi_pruned_pages; + +static struct toi_module_ops toi_prune_ops; +static struct toi_module_ops *next_driver; + +static char toi_prune_hash_algo_name[32] = "sha1"; + +static DEFINE_MUTEX(stats_lock); + +struct cpu_context { + struct shash_desc desc; + char *digest; +}; + +#define OUT_BUF_SIZE (2 * PAGE_SIZE) + +static DEFINE_PER_CPU(struct cpu_context, contexts); + +/* + * toi_crypto_prepare + * + * Prepare to do some work by allocating buffers and transforms. + */ +static int toi_prune_crypto_prepare(void) +{ + int cpu, ret, digestsize; + + if (!*toi_prune_hash_algo_name) { + printk(KERN_INFO "TuxOnIce: Pruning enabled but no " + "hash algorithm set.\n"); + return 1; + } + + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + this->desc.tfm = crypto_alloc_shash(toi_prune_hash_algo_name, 0, 0); + if (IS_ERR(this->desc.tfm)) { + printk(KERN_INFO "TuxOnIce: Failed to allocate the " + "%s prune hash algorithm.\n", + toi_prune_hash_algo_name); + this->desc.tfm = NULL; + return 1; + } + + if (!digestsize) + digestsize = crypto_shash_digestsize(this->desc.tfm); + + this->digest = kmalloc(digestsize, GFP_KERNEL); + if (!this->digest) { + printk(KERN_INFO "TuxOnIce: Failed to allocate space " + "for digest output.\n"); + crypto_free_shash(this->desc.tfm); + this->desc.tfm = NULL; + } + + this->desc.flags = 0; + + ret = crypto_shash_init(&this->desc); + if (ret < 0) { + printk(KERN_INFO "TuxOnIce: Failed to initialise the " + "%s prune hash algorithm.\n", + toi_prune_hash_algo_name); + kfree(this->digest); + this->digest = NULL; + crypto_free_shash(this->desc.tfm); + this->desc.tfm = NULL; + return 1; + } + } + + return 0; +} + +static int toi_prune_rw_cleanup(int writing) +{ + int cpu; + + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + if (this->desc.tfm) { + crypto_free_shash(this->desc.tfm); + this->desc.tfm = NULL; + } + + if (this->digest) { + kfree(this->digest); + this->digest = NULL; + } + } + + return 0; +} + +/* + * toi_prune_init + */ + +static int toi_prune_init(int toi_or_resume) +{ + if (!toi_or_resume) + return 0; + + toi_pruned_pages = 0; + + next_driver = toi_get_next_filter(&toi_prune_ops); + + return next_driver ? 0 : -ECHILD; +} + +/* + * toi_prune_rw_init() + */ + +static int toi_prune_rw_init(int rw, int stream_number) +{ + if (toi_prune_crypto_prepare()) { + printk(KERN_ERR "Failed to initialise prune " + "algorithm.\n"); + if (rw == READ) { + printk(KERN_INFO "Unable to read the image.\n"); + return -ENODEV; + } else { + printk(KERN_INFO "Continuing without " + "pruning the image.\n"); + toi_prune_ops.enabled = 0; + } + } + + return 0; +} + +/* + * toi_prune_write_page() + * + * Compress a page of data, buffering output and passing on filled + * pages to the next module in the pipeline. + * + * Buffer_page: Pointer to a buffer of size PAGE_SIZE, containing + * data to be checked. + * + * Returns: 0 on success. Otherwise the error is that returned by later + * modules, -ECHILD if we have a broken pipeline or -EIO if + * zlib errs. + */ +static int toi_prune_write_page(unsigned long index, int buf_type, + void *buffer_page, unsigned int buf_size) +{ + int ret = 0, cpu = smp_processor_id(), write_data = 1; + struct cpu_context *ctx = &per_cpu(contexts, cpu); + u8* output_buffer = buffer_page; + int output_len = buf_size; + int out_buf_type = buf_type; + void *buffer_start; + u32 buf[4]; + + if (ctx->desc.tfm) { + + buffer_start = TOI_MAP(buf_type, buffer_page); + ctx->len = OUT_BUF_SIZE; + + ret = crypto_shash_digest(&ctx->desc, buffer_start, buf_size, &ctx->digest); + if (ret) { + printk(KERN_INFO "TuxOnIce: Failed to calculate digest (%d).\n", ret); + } else { + mutex_lock(&stats_lock); + + toi_pruned_pages++; + + mutex_unlock(&stats_lock); + + } + + TOI_UNMAP(buf_type, buffer_page); + } + + if (write_data) + ret = next_driver->write_page(index, out_buf_type, + output_buffer, output_len); + else + ret = next_driver->write_page(index, out_buf_type, + output_buffer, output_len); + + return ret; +} + +/* + * toi_prune_read_page() + * @buffer_page: struct page *. Pointer to a buffer of size PAGE_SIZE. + * + * Retrieve data from later modules or from a previously loaded page and + * fill the input buffer. + * Zero if successful. Error condition from me or from downstream on failure. + */ +static int toi_prune_read_page(unsigned long *index, int buf_type, + void *buffer_page, unsigned int *buf_size) +{ + int ret, cpu = smp_processor_id(); + unsigned int len; + char *buffer_start; + struct cpu_context *ctx = &per_cpu(contexts, cpu); + + if (!ctx->desc.tfm) + return next_driver->read_page(index, TOI_PAGE, buffer_page, + buf_size); + + /* + * All our reads must be synchronous - we can't handle + * data that hasn't been read yet. + */ + + ret = next_driver->read_page(index, buf_type, buffer_page, &len); + + if (len == PRUNE_DATA_IS_PFN) { + buffer_start = kmap(buffer_page); + } + + return ret; +} + +/* + * toi_prune_print_debug_stats + * @buffer: Pointer to a buffer into which the debug info will be printed. + * @size: Size of the buffer. + * + * Print information to be recorded for debugging purposes into a buffer. + * Returns: Number of characters written to the buffer. + */ + +static int toi_prune_print_debug_stats(char *buffer, int size) +{ + int len; + + /* Output the number of pages pruned. */ + if (*toi_prune_hash_algo_name) + len = scnprintf(buffer, size, "- Compressor is '%s'.\n", + toi_prune_hash_algo_name); + else + len = scnprintf(buffer, size, "- Compressor is not set.\n"); + + if (toi_pruned_pages) + len += scnprintf(buffer+len, size - len, " Pruned " + "%lu pages).\n", + toi_pruned_pages); + return len; +} + +/* + * toi_prune_memory_needed + * + * Tell the caller how much memory we need to operate during hibernate/resume. + * Returns: Unsigned long. Maximum number of bytes of memory required for + * operation. + */ +static int toi_prune_memory_needed(void) +{ + return 2 * PAGE_SIZE; +} + +static int toi_prune_storage_needed(void) +{ + return 2 * sizeof(unsigned long) + 2 * sizeof(int) + + strlen(toi_prune_hash_algo_name) + 1; +} + +/* + * toi_prune_save_config_info + * @buffer: Pointer to a buffer of size PAGE_SIZE. + * + * Save informaton needed when reloading the image at resume time. + * Returns: Number of bytes used for saving our data. + */ +static int toi_prune_save_config_info(char *buffer) +{ + int len = strlen(toi_prune_hash_algo_name) + 1, offset = 0; + + *((unsigned long *) buffer) = toi_pruned_pages; + offset += sizeof(unsigned long); + *((int *) (buffer + offset)) = len; + offset += sizeof(int); + strncpy(buffer + offset, toi_prune_hash_algo_name, len); + return offset + len; +} + +/* toi_prune_load_config_info + * @buffer: Pointer to the start of the data. + * @size: Number of bytes that were saved. + * + * Description: Reload information needed for passing back to the + * resumed kernel. + */ +static void toi_prune_load_config_info(char *buffer, int size) +{ + int len, offset = 0; + + toi_pruned_pages = *((unsigned long *) buffer); + offset += sizeof(unsigned long); + len = *((int *) (buffer + offset)); + offset += sizeof(int); + strncpy(toi_prune_hash_algo_name, buffer + offset, len); +} + +static void toi_prune_pre_atomic_restore(struct toi_boot_kernel_data *bkd) +{ + bkd->pruned_pages = toi_pruned_pages; +} + +static void toi_prune_post_atomic_restore(struct toi_boot_kernel_data *bkd) +{ + toi_pruned_pages = bkd->pruned_pages; +} + +/* + * toi_expected_ratio + * + * Description: Returns the expected ratio between data passed into this module + * and the amount of data output when writing. + * Returns: 100 - we have no idea how many pages will be pruned. + */ + +static int toi_prune_expected_ratio(void) +{ + return 100; +} + +/* + * data for our sysfs entries. + */ +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_INT("enabled", SYSFS_RW, &toi_prune_ops.enabled, 0, 1, 0, + NULL), + SYSFS_STRING("algorithm", SYSFS_RW, toi_prune_hash_algo_name, 31, 0, NULL), +}; + +/* + * Ops structure. + */ +static struct toi_module_ops toi_prune_ops = { + .type = FILTER_MODULE, + .name = "prune", + .directory = "prune", + .module = THIS_MODULE, + .initialise = toi_prune_init, + .memory_needed = toi_prune_memory_needed, + .print_debug_info = toi_prune_print_debug_stats, + .save_config_info = toi_prune_save_config_info, + .load_config_info = toi_prune_load_config_info, + .storage_needed = toi_prune_storage_needed, + .expected_compression = toi_prune_expected_ratio, + + .pre_atomic_restore = toi_prune_pre_atomic_restore, + .post_atomic_restore = toi_prune_post_atomic_restore, + + .rw_init = toi_prune_rw_init, + .rw_cleanup = toi_prune_rw_cleanup, + + .write_page = toi_prune_write_page, + .read_page = toi_prune_read_page, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ + +static __init int toi_prune_load(void) +{ + return toi_register_module(&toi_prune_ops); +} + +late_initcall(toi_prune_load); diff --git b/kernel/power/tuxonice_storage.c b/kernel/power/tuxonice_storage.c new file mode 100644 index 0000000..d8539c2 --- /dev/null +++ b/kernel/power/tuxonice_storage.c @@ -0,0 +1,282 @@ +/* + * kernel/power/tuxonice_storage.c + * + * Copyright (C) 2005-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Routines for talking to a userspace program that manages storage. + * + * The kernel side: + * - starts the userspace program; + * - sends messages telling it when to open and close the connection; + * - tells it when to quit; + * + * The user space side: + * - passes messages regarding status; + * + */ + +#include +#include + +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_netlink.h" +#include "tuxonice_storage.h" +#include "tuxonice_ui.h" + +static struct user_helper_data usm_helper_data; +static struct toi_module_ops usm_ops; +static int message_received, usm_prepare_count; +static int storage_manager_last_action, storage_manager_action; + +static int usm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) +{ + int type; + int *data; + + type = nlh->nlmsg_type; + + /* A control message: ignore them */ + if (type < NETLINK_MSG_BASE) + return 0; + + /* Unknown message: reply with EINVAL */ + if (type >= USM_MSG_MAX) + return -EINVAL; + + /* All operations require privileges, even GET */ + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + /* Only allow one task to receive NOFREEZE privileges */ + if (type == NETLINK_MSG_NOFREEZE_ME && usm_helper_data.pid != -1) + return -EBUSY; + + data = (int *) NLMSG_DATA(nlh); + + switch (type) { + case USM_MSG_SUCCESS: + case USM_MSG_FAILED: + message_received = type; + complete(&usm_helper_data.wait_for_process); + break; + default: + printk(KERN_INFO "Storage manager doesn't recognise " + "message %d.\n", type); + } + + return 1; +} + +#ifdef CONFIG_NET +static int activations; + +int toi_activate_storage(int force) +{ + int tries = 1; + + if (usm_helper_data.pid == -1 || !usm_ops.enabled) + return 0; + + message_received = 0; + activations++; + + if (activations > 1 && !force) + return 0; + + while ((!message_received || message_received == USM_MSG_FAILED) && + tries < 2) { + toi_prepare_status(DONT_CLEAR_BAR, "Activate storage attempt " + "%d.\n", tries); + + init_completion(&usm_helper_data.wait_for_process); + + toi_send_netlink_message(&usm_helper_data, + USM_MSG_CONNECT, + NULL, 0); + + /* Wait 2 seconds for the userspace process to make contact */ + wait_for_completion_timeout(&usm_helper_data.wait_for_process, + 2*HZ); + + tries++; + } + + return 0; +} + +int toi_deactivate_storage(int force) +{ + if (usm_helper_data.pid == -1 || !usm_ops.enabled) + return 0; + + message_received = 0; + activations--; + + if (activations && !force) + return 0; + + init_completion(&usm_helper_data.wait_for_process); + + toi_send_netlink_message(&usm_helper_data, + USM_MSG_DISCONNECT, + NULL, 0); + + wait_for_completion_timeout(&usm_helper_data.wait_for_process, 2*HZ); + + if (!message_received || message_received == USM_MSG_FAILED) { + printk(KERN_INFO "Returning failure disconnecting storage.\n"); + return 1; + } + + return 0; +} +#endif + +static void storage_manager_simulate(void) +{ + printk(KERN_INFO "--- Storage manager simulate ---\n"); + toi_prepare_usm(); + schedule(); + printk(KERN_INFO "--- Activate storage 1 ---\n"); + toi_activate_storage(1); + schedule(); + printk(KERN_INFO "--- Deactivate storage 1 ---\n"); + toi_deactivate_storage(1); + schedule(); + printk(KERN_INFO "--- Cleanup usm ---\n"); + toi_cleanup_usm(); + schedule(); + printk(KERN_INFO "--- Storage manager simulate ends ---\n"); +} + +static int usm_storage_needed(void) +{ + return sizeof(int) + strlen(usm_helper_data.program) + 1; +} + +static int usm_save_config_info(char *buf) +{ + int len = strlen(usm_helper_data.program); + memcpy(buf, usm_helper_data.program, len + 1); + return sizeof(int) + len + 1; +} + +static void usm_load_config_info(char *buf, int size) +{ + /* Don't load the saved path if one has already been set */ + if (usm_helper_data.program[0]) + return; + + memcpy(usm_helper_data.program, buf + sizeof(int), *((int *) buf)); +} + +static int usm_memory_needed(void) +{ + /* ball park figure of 32 pages */ + return 32 * PAGE_SIZE; +} + +/* toi_prepare_usm + */ +int toi_prepare_usm(void) +{ + usm_prepare_count++; + + if (usm_prepare_count > 1 || !usm_ops.enabled) + return 0; + + usm_helper_data.pid = -1; + + if (!*usm_helper_data.program) + return 0; + + toi_netlink_setup(&usm_helper_data); + + if (usm_helper_data.pid == -1) + printk(KERN_INFO "TuxOnIce Storage Manager wanted, but couldn't" + " start it.\n"); + + toi_activate_storage(0); + + return usm_helper_data.pid != -1; +} + +void toi_cleanup_usm(void) +{ + usm_prepare_count--; + + if (usm_helper_data.pid > -1 && !usm_prepare_count) { + toi_deactivate_storage(0); + toi_netlink_close(&usm_helper_data); + } +} + +static void storage_manager_activate(void) +{ + if (storage_manager_action == storage_manager_last_action) + return; + + if (storage_manager_action) + toi_prepare_usm(); + else + toi_cleanup_usm(); + + storage_manager_last_action = storage_manager_action; +} + +/* + * User interface specific /sys/power/tuxonice entries. + */ + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_NONE("simulate_atomic_copy", storage_manager_simulate), + SYSFS_INT("enabled", SYSFS_RW, &usm_ops.enabled, 0, 1, 0, NULL), + SYSFS_STRING("program", SYSFS_RW, usm_helper_data.program, 254, 0, + NULL), + SYSFS_INT("activate_storage", SYSFS_RW , &storage_manager_action, 0, 1, + 0, storage_manager_activate) +}; + +static struct toi_module_ops usm_ops = { + .type = MISC_MODULE, + .name = "usm", + .directory = "storage_manager", + .module = THIS_MODULE, + .storage_needed = usm_storage_needed, + .save_config_info = usm_save_config_info, + .load_config_info = usm_load_config_info, + .memory_needed = usm_memory_needed, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* toi_usm_sysfs_init + * Description: Boot time initialisation for user interface. + */ +int toi_usm_init(void) +{ + usm_helper_data.nl = NULL; + usm_helper_data.program[0] = '\0'; + usm_helper_data.pid = -1; + usm_helper_data.skb_size = 0; + usm_helper_data.pool_limit = 6; + usm_helper_data.netlink_id = NETLINK_TOI_USM; + usm_helper_data.name = "userspace storage manager"; + usm_helper_data.rcv_msg = usm_user_rcv_msg; + usm_helper_data.interface_version = 2; + usm_helper_data.must_init = 0; + init_completion(&usm_helper_data.wait_for_process); + + return toi_register_module(&usm_ops); +} + +void toi_usm_exit(void) +{ + toi_netlink_close_complete(&usm_helper_data); + toi_unregister_module(&usm_ops); +} diff --git b/kernel/power/tuxonice_storage.h b/kernel/power/tuxonice_storage.h new file mode 100644 index 0000000..0189c88 --- /dev/null +++ b/kernel/power/tuxonice_storage.h @@ -0,0 +1,45 @@ +/* + * kernel/power/tuxonice_storage.h + * + * Copyright (C) 2005-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + */ + +#ifdef CONFIG_NET +int toi_prepare_usm(void); +void toi_cleanup_usm(void); + +int toi_activate_storage(int force); +int toi_deactivate_storage(int force); +extern int toi_usm_init(void); +extern void toi_usm_exit(void); +#else +static inline int toi_usm_init(void) { return 0; } +static inline void toi_usm_exit(void) { } + +static inline int toi_activate_storage(int force) +{ + return 0; +} + +static inline int toi_deactivate_storage(int force) +{ + return 0; +} + +static inline int toi_prepare_usm(void) { return 0; } +static inline void toi_cleanup_usm(void) { } +#endif + +enum { + USM_MSG_BASE = 0x10, + + /* Kernel -> Userspace */ + USM_MSG_CONNECT = 0x30, + USM_MSG_DISCONNECT = 0x31, + USM_MSG_SUCCESS = 0x40, + USM_MSG_FAILED = 0x41, + + USM_MSG_MAX, +}; diff --git b/kernel/power/tuxonice_swap.c b/kernel/power/tuxonice_swap.c new file mode 100644 index 0000000..9f555c9 --- /dev/null +++ b/kernel/power/tuxonice_swap.c @@ -0,0 +1,474 @@ +/* + * kernel/power/tuxonice_swap.c + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * Distributed under GPLv2. + * + * This file encapsulates functions for usage of swap space as a + * backing store. + */ + +#include +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_extent.h" +#include "tuxonice_bio.h" +#include "tuxonice_alloc.h" +#include "tuxonice_builtin.h" + +static struct toi_module_ops toi_swapops; + +/* For swapfile automatically swapon/off'd. */ +static char swapfilename[255] = ""; +static int toi_swapon_status; + +/* Swap Pages */ +static unsigned long swap_allocated; + +static struct sysinfo swapinfo; + +static int is_ram_backed(struct swap_info_struct *si) +{ + if (!strncmp(si->bdev->bd_disk->disk_name, "ram", 3) || + !strncmp(si->bdev->bd_disk->disk_name, "zram", 4)) + return 1; + + return 0; +} + +/** + * enable_swapfile: Swapon the user specified swapfile prior to hibernating. + * + * Activate the given swapfile if it wasn't already enabled. Remember whether + * we really did swapon it for swapoffing later. + */ +static void enable_swapfile(void) +{ + int activateswapresult = -EINVAL; + + if (swapfilename[0]) { + /* Attempt to swap on with maximum priority */ + activateswapresult = sys_swapon(swapfilename, 0xFFFF); + if (activateswapresult && activateswapresult != -EBUSY) + printk(KERN_ERR "TuxOnIce: The swapfile/partition " + "specified by /sys/power/tuxonice/swap/swapfile" + " (%s) could not be turned on (error %d). " + "Attempting to continue.\n", + swapfilename, activateswapresult); + if (!activateswapresult) + toi_swapon_status = 1; + } +} + +/** + * disable_swapfile: Swapoff any file swaponed at the start of the cycle. + * + * If we did successfully swapon a file at the start of the cycle, swapoff + * it now (finishing up). + */ +static void disable_swapfile(void) +{ + if (!toi_swapon_status) + return; + + sys_swapoff(swapfilename); + toi_swapon_status = 0; +} + +static int add_blocks_to_extent_chain(struct toi_bdev_info *chain, + unsigned long start, unsigned long end) +{ + if (test_action_state(TOI_TEST_BIO)) + toi_message(TOI_IO, TOI_VERBOSE, 0, "Adding extent %lu-%lu to " + "chain %p.", start << chain->bmap_shift, + end << chain->bmap_shift, chain); + + return toi_add_to_extent_chain(&chain->blocks, start, end); +} + + +static int get_main_pool_phys_params(struct toi_bdev_info *chain) +{ + struct hibernate_extent *extentpointer = NULL; + unsigned long address, extent_min = 0, extent_max = 0; + int empty = 1; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "get main pool phys params for " + "chain %d.", chain->allocator_index); + + if (!chain->allocations.first) + return 0; + + if (chain->blocks.first) + toi_put_extent_chain(&chain->blocks); + + toi_extent_for_each(&chain->allocations, extentpointer, address) { + swp_entry_t swap_address = (swp_entry_t) { address }; + struct block_device *bdev; + sector_t new_sector = map_swap_entry(swap_address, &bdev); + + if (empty) { + empty = 0; + extent_min = extent_max = new_sector; + continue; + } + + if (new_sector == extent_max + 1) { + extent_max++; + continue; + } + + if (add_blocks_to_extent_chain(chain, extent_min, extent_max)) { + printk(KERN_ERR "Out of memory while making block " + "chains.\n"); + return -ENOMEM; + } + + extent_min = new_sector; + extent_max = new_sector; + } + + if (!empty && + add_blocks_to_extent_chain(chain, extent_min, extent_max)) { + printk(KERN_ERR "Out of memory while making block chains.\n"); + return -ENOMEM; + } + + return 0; +} + +/* + * Like si_swapinfo, except that we don't include ram backed swap (compcache!) + * and don't need to use the spinlocks (userspace is stopped when this + * function is called). + */ +void si_swapinfo_no_compcache(void) +{ + unsigned int i; + + si_swapinfo(&swapinfo); + swapinfo.freeswap = 0; + swapinfo.totalswap = 0; + + for (i = 0; i < MAX_SWAPFILES; i++) { + struct swap_info_struct *si = get_swap_info_struct(i); + if (si && (si->flags & SWP_WRITEOK) && !is_ram_backed(si)) { + swapinfo.totalswap += si->inuse_pages; + swapinfo.freeswap += si->pages - si->inuse_pages; + } + } +} +/* + * We can't just remember the value from allocation time, because other + * processes might have allocated swap in the mean time. + */ +static unsigned long toi_swap_storage_available(void) +{ + toi_message(TOI_IO, TOI_VERBOSE, 0, "In toi_swap_storage_available."); + si_swapinfo_no_compcache(); + return swapinfo.freeswap + swap_allocated; +} + +static int toi_swap_initialise(int starting_cycle) +{ + if (!starting_cycle) + return 0; + + enable_swapfile(); + return 0; +} + +static void toi_swap_cleanup(int ending_cycle) +{ + if (!ending_cycle) + return; + + disable_swapfile(); +} + +static void toi_swap_free_storage(struct toi_bdev_info *chain) +{ + /* Free swap entries */ + struct hibernate_extent *extentpointer; + unsigned long extentvalue; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Freeing storage for chain %p.", + chain); + + swap_allocated -= chain->allocations.size; + toi_extent_for_each(&chain->allocations, extentpointer, extentvalue) + swap_free((swp_entry_t) { extentvalue }); + + toi_put_extent_chain(&chain->allocations); +} + +static void free_swap_range(unsigned long min, unsigned long max) +{ + int j; + + for (j = min; j <= max; j++) + swap_free((swp_entry_t) { j }); + swap_allocated -= (max - min + 1); +} + +/* + * Allocation of a single swap type. Swap priorities are handled at the higher + * level. + */ +static int toi_swap_allocate_storage(struct toi_bdev_info *chain, + unsigned long request) +{ + unsigned long gotten = 0; + + toi_message(TOI_IO, TOI_VERBOSE, 0, " Swap allocate storage: Asked to" + " allocate %lu pages from device %d.", request, + chain->allocator_index); + + while (gotten < request) { + swp_entry_t start, end; + if (0) { + /* Broken at the moment for SSDs */ + get_swap_range_of_type(chain->allocator_index, &start, &end, + request - gotten + 1); + } else { + start = end = get_swap_page_of_type(chain->allocator_index); + } + if (start.val) { + int added = end.val - start.val + 1; + if (toi_add_to_extent_chain(&chain->allocations, + start.val, end.val)) { + printk(KERN_INFO "Failed to allocate extent for " + "%lu-%lu.\n", start.val, end.val); + free_swap_range(start.val, end.val); + break; + } + gotten += added; + swap_allocated += added; + } else + break; + } + + toi_message(TOI_IO, TOI_VERBOSE, 0, " Allocated %lu pages.", gotten); + return gotten; +} + +static int toi_swap_register_storage(void) +{ + int i, result = 0; + + toi_message(TOI_IO, TOI_VERBOSE, 0, "toi_swap_register_storage."); + for (i = 0; i < MAX_SWAPFILES; i++) { + struct swap_info_struct *si = get_swap_info_struct(i); + struct toi_bdev_info *devinfo; + unsigned char *p; + unsigned char buf[256]; + struct fs_info *fs_info; + + if (!si || !(si->flags & SWP_WRITEOK) || is_ram_backed(si)) + continue; + + devinfo = toi_kzalloc(39, sizeof(struct toi_bdev_info), + GFP_ATOMIC); + if (!devinfo) { + printk("Failed to allocate devinfo struct for swap " + "device %d.\n", i); + return -ENOMEM; + } + + devinfo->bdev = si->bdev; + devinfo->allocator = &toi_swapops; + devinfo->allocator_index = i; + + fs_info = fs_info_from_block_dev(si->bdev); + if (fs_info && !IS_ERR(fs_info)) { + memcpy(devinfo->uuid, &fs_info->uuid, 16); + free_fs_info(fs_info); + } else + result = (int) PTR_ERR(fs_info); + + if (!fs_info) + printk("fs_info from block dev returned %d.\n", result); + devinfo->dev_t = si->bdev->bd_dev; + devinfo->prio = si->prio; + devinfo->bmap_shift = 3; + devinfo->blocks_per_page = 1; + + p = d_path(&si->swap_file->f_path, buf, sizeof(buf)); + sprintf(devinfo->name, "swap on %s", p); + + toi_message(TOI_IO, TOI_VERBOSE, 0, "Registering swap storage:" + " Device %d (%lx), prio %d.", i, + (unsigned long) devinfo->dev_t, devinfo->prio); + toi_bio_ops.register_storage(devinfo); + } + + return 0; +} + +static unsigned long toi_swap_free_unused_storage(struct toi_bdev_info *chain, unsigned long used) +{ + struct hibernate_extent *extentpointer = NULL; + unsigned long extentvalue; + unsigned long i = 0, first_freed = 0; + + toi_extent_for_each(&chain->allocations, extentpointer, extentvalue) { + i++; + if (i > used) { + swap_free((swp_entry_t) { extentvalue }); + if (!first_freed) + first_freed = extentvalue; + } + } + + return first_freed; +} + +/* + * workspace_size + * + * Description: + * Returns the number of bytes of RAM needed for this + * code to do its work. (Used when calculating whether + * we have enough memory to be able to hibernate & resume). + * + */ +static int toi_swap_memory_needed(void) +{ + return 1; +} + +/* + * Print debug info + * + * Description: + */ +static int toi_swap_print_debug_stats(char *buffer, int size) +{ + int len = 0; + + len = scnprintf(buffer, size, "- Swap Allocator enabled.\n"); + if (swapfilename[0]) + len += scnprintf(buffer+len, size-len, + " Attempting to automatically swapon: %s.\n", + swapfilename); + + si_swapinfo_no_compcache(); + + len += scnprintf(buffer+len, size-len, + " Swap available for image: %lu pages.\n", + swapinfo.freeswap + swap_allocated); + + return len; +} + +static int header_locations_read_sysfs(const char *page, int count) +{ + int i, printedpartitionsmessage = 0, len = 0, haveswap = 0; + struct inode *swapf = NULL; + int zone; + char *path_page = (char *) toi_get_free_page(10, GFP_KERNEL); + char *path, *output = (char *) page; + int path_len; + + if (!page) + return 0; + + for (i = 0; i < MAX_SWAPFILES; i++) { + struct swap_info_struct *si = get_swap_info_struct(i); + + if (!si || !(si->flags & SWP_WRITEOK)) + continue; + + if (S_ISBLK(si->swap_file->f_mapping->host->i_mode)) { + haveswap = 1; + if (!printedpartitionsmessage) { + len += sprintf(output + len, + "For swap partitions, simply use the " + "format: resume=swap:/dev/hda1.\n"); + printedpartitionsmessage = 1; + } + } else { + path_len = 0; + + path = d_path(&si->swap_file->f_path, path_page, + PAGE_SIZE); + path_len = snprintf(path_page, PAGE_SIZE, "%s", path); + + haveswap = 1; + swapf = si->swap_file->f_mapping->host; + zone = bmap(swapf, 0); + if (!zone) { + len += sprintf(output + len, + "Swapfile %s has been corrupted. Reuse" + " mkswap on it and try again.\n", + path_page); + } else { + char name_buffer[BDEVNAME_SIZE]; + len += sprintf(output + len, + "For swapfile `%s`," + " use resume=swap:/dev/%s:0x%x.\n", + path_page, + bdevname(si->bdev, name_buffer), + zone << (swapf->i_blkbits - 9)); + } + } + } + + if (!haveswap) + len = sprintf(output, "You need to turn on swap partitions " + "before examining this file.\n"); + + toi_free_page(10, (unsigned long) path_page); + return len; +} + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_STRING("swapfilename", SYSFS_RW, swapfilename, 255, 0, NULL), + SYSFS_CUSTOM("headerlocations", SYSFS_READONLY, + header_locations_read_sysfs, NULL, 0, NULL), + SYSFS_INT("enabled", SYSFS_RW, &toi_swapops.enabled, 0, 1, 0, + attempt_to_parse_resume_device2), +}; + +static struct toi_bio_allocator_ops toi_bio_swapops = { + .register_storage = toi_swap_register_storage, + .storage_available = toi_swap_storage_available, + .allocate_storage = toi_swap_allocate_storage, + .bmap = get_main_pool_phys_params, + .free_storage = toi_swap_free_storage, + .free_unused_storage = toi_swap_free_unused_storage, +}; + +static struct toi_module_ops toi_swapops = { + .type = BIO_ALLOCATOR_MODULE, + .name = "swap storage", + .directory = "swap", + .module = THIS_MODULE, + .memory_needed = toi_swap_memory_needed, + .print_debug_info = toi_swap_print_debug_stats, + .initialise = toi_swap_initialise, + .cleanup = toi_swap_cleanup, + .bio_allocator_ops = &toi_bio_swapops, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ +static __init int toi_swap_load(void) +{ + return toi_register_module(&toi_swapops); +} + +late_initcall(toi_swap_load); diff --git b/kernel/power/tuxonice_sysfs.c b/kernel/power/tuxonice_sysfs.c new file mode 100644 index 0000000..77f36db --- /dev/null +++ b/kernel/power/tuxonice_sysfs.c @@ -0,0 +1,333 @@ +/* + * kernel/power/tuxonice_sysfs.c + * + * Copyright (C) 2002-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * This file contains support for sysfs entries for tuning TuxOnIce. + * + * We have a generic handler that deals with the most common cases, and + * hooks for special handlers to use. + */ + +#include + +#include "tuxonice_sysfs.h" +#include "tuxonice.h" +#include "tuxonice_storage.h" +#include "tuxonice_alloc.h" + +static int toi_sysfs_initialised; + +static void toi_initialise_sysfs(void); + +static struct toi_sysfs_data sysfs_params[]; + +#define to_sysfs_data(_attr) container_of(_attr, struct toi_sysfs_data, attr) + +static void toi_main_wrapper(void) +{ + toi_try_hibernate(); +} + +static ssize_t toi_attr_show(struct kobject *kobj, struct attribute *attr, + char *page) +{ + struct toi_sysfs_data *sysfs_data = to_sysfs_data(attr); + int len = 0; + int full_prep = sysfs_data->flags & SYSFS_NEEDS_SM_FOR_READ; + + if (full_prep && toi_start_anything(0)) + return -EBUSY; + + if (sysfs_data->flags & SYSFS_NEEDS_SM_FOR_READ) + toi_prepare_usm(); + + switch (sysfs_data->type) { + case TOI_SYSFS_DATA_CUSTOM: + len = (sysfs_data->data.special.read_sysfs) ? + (sysfs_data->data.special.read_sysfs)(page, PAGE_SIZE) + : 0; + break; + case TOI_SYSFS_DATA_BIT: + len = sprintf(page, "%d\n", + -test_bit(sysfs_data->data.bit.bit, + sysfs_data->data.bit.bit_vector)); + break; + case TOI_SYSFS_DATA_INTEGER: + len = sprintf(page, "%d\n", + *(sysfs_data->data.integer.variable)); + break; + case TOI_SYSFS_DATA_LONG: + len = sprintf(page, "%ld\n", + *(sysfs_data->data.a_long.variable)); + break; + case TOI_SYSFS_DATA_UL: + len = sprintf(page, "%lu\n", + *(sysfs_data->data.ul.variable)); + break; + case TOI_SYSFS_DATA_STRING: + len = sprintf(page, "%s\n", + sysfs_data->data.string.variable); + break; + } + + if (sysfs_data->flags & SYSFS_NEEDS_SM_FOR_READ) + toi_cleanup_usm(); + + if (full_prep) + toi_finish_anything(0); + + return len; +} + +#define BOUND(_variable, _type) do { \ + if (*_variable < sysfs_data->data._type.minimum) \ + *_variable = sysfs_data->data._type.minimum; \ + else if (*_variable > sysfs_data->data._type.maximum) \ + *_variable = sysfs_data->data._type.maximum; \ +} while (0) + +static ssize_t toi_attr_store(struct kobject *kobj, struct attribute *attr, + const char *my_buf, size_t count) +{ + int assigned_temp_buffer = 0, result = count; + struct toi_sysfs_data *sysfs_data = to_sysfs_data(attr); + + if (toi_start_anything((sysfs_data->flags & SYSFS_HIBERNATE_OR_RESUME))) + return -EBUSY; + + ((char *) my_buf)[count] = 0; + + if (sysfs_data->flags & SYSFS_NEEDS_SM_FOR_WRITE) + toi_prepare_usm(); + + switch (sysfs_data->type) { + case TOI_SYSFS_DATA_CUSTOM: + if (sysfs_data->data.special.write_sysfs) + result = (sysfs_data->data.special.write_sysfs)(my_buf, + count); + break; + case TOI_SYSFS_DATA_BIT: + { + unsigned long value; + result = kstrtoul(my_buf, 0, &value); + if (result) + break; + if (value) + set_bit(sysfs_data->data.bit.bit, + (sysfs_data->data.bit.bit_vector)); + else + clear_bit(sysfs_data->data.bit.bit, + (sysfs_data->data.bit.bit_vector)); + } + break; + case TOI_SYSFS_DATA_INTEGER: + { + long temp; + result = kstrtol(my_buf, 0, &temp); + if (result) + break; + *(sysfs_data->data.integer.variable) = (int) temp; + BOUND(sysfs_data->data.integer.variable, integer); + break; + } + case TOI_SYSFS_DATA_LONG: + { + long *variable = + sysfs_data->data.a_long.variable; + result = kstrtol(my_buf, 0, variable); + if (result) + break; + BOUND(variable, a_long); + break; + } + case TOI_SYSFS_DATA_UL: + { + unsigned long *variable = + sysfs_data->data.ul.variable; + result = kstrtoul(my_buf, 0, variable); + if (result) + break; + BOUND(variable, ul); + break; + } + break; + case TOI_SYSFS_DATA_STRING: + { + int copy_len = count; + char *variable = + sysfs_data->data.string.variable; + + if (sysfs_data->data.string.max_length && + (copy_len > sysfs_data->data.string.max_length)) + copy_len = sysfs_data->data.string.max_length; + + if (!variable) { + variable = (char *) toi_get_zeroed_page(31, + TOI_ATOMIC_GFP); + sysfs_data->data.string.variable = variable; + assigned_temp_buffer = 1; + } + strncpy(variable, my_buf, copy_len); + if (copy_len && my_buf[copy_len - 1] == '\n') + variable[count - 1] = 0; + variable[count] = 0; + } + break; + } + + if (!result) + result = count; + + /* Side effect routine? */ + if (result == count && sysfs_data->write_side_effect) + sysfs_data->write_side_effect(); + + /* Free temporary buffers */ + if (assigned_temp_buffer) { + toi_free_page(31, + (unsigned long) sysfs_data->data.string.variable); + sysfs_data->data.string.variable = NULL; + } + + if (sysfs_data->flags & SYSFS_NEEDS_SM_FOR_WRITE) + toi_cleanup_usm(); + + toi_finish_anything(sysfs_data->flags & SYSFS_HIBERNATE_OR_RESUME); + + return result; +} + +static struct sysfs_ops toi_sysfs_ops = { + .show = &toi_attr_show, + .store = &toi_attr_store, +}; + +static struct kobj_type toi_ktype = { + .sysfs_ops = &toi_sysfs_ops, +}; + +struct kobject *tuxonice_kobj; + +/* Non-module sysfs entries. + * + * This array contains entries that are automatically registered at + * boot. Modules and the console code register their own entries separately. + */ + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_CUSTOM("do_hibernate", SYSFS_WRITEONLY, NULL, NULL, + SYSFS_HIBERNATING, toi_main_wrapper), + SYSFS_CUSTOM("do_resume", SYSFS_WRITEONLY, NULL, NULL, + SYSFS_RESUMING, toi_try_resume) +}; + +void remove_toi_sysdir(struct kobject *kobj) +{ + if (!kobj) + return; + + kobject_put(kobj); +} + +struct kobject *make_toi_sysdir(char *name) +{ + struct kobject *kobj = kobject_create_and_add(name, tuxonice_kobj); + + if (!kobj) { + printk(KERN_INFO "TuxOnIce: Can't allocate kobject for sysfs " + "dir!\n"); + return NULL; + } + + kobj->ktype = &toi_ktype; + + return kobj; +} + +/* toi_register_sysfs_file + * + * Helper for registering a new /sysfs/tuxonice entry. + */ + +int toi_register_sysfs_file( + struct kobject *kobj, + struct toi_sysfs_data *toi_sysfs_data) +{ + int result; + + if (!toi_sysfs_initialised) + toi_initialise_sysfs(); + + result = sysfs_create_file(kobj, &toi_sysfs_data->attr); + if (result) + printk(KERN_INFO "TuxOnIce: sysfs_create_file for %s " + "returned %d.\n", + toi_sysfs_data->attr.name, result); + kobj->ktype = &toi_ktype; + + return result; +} + +/* toi_unregister_sysfs_file + * + * Helper for removing unwanted /sys/power/tuxonice entries. + * + */ +void toi_unregister_sysfs_file(struct kobject *kobj, + struct toi_sysfs_data *toi_sysfs_data) +{ + sysfs_remove_file(kobj, &toi_sysfs_data->attr); +} + +void toi_cleanup_sysfs(void) +{ + int i, + numfiles = sizeof(sysfs_params) / sizeof(struct toi_sysfs_data); + + if (!toi_sysfs_initialised) + return; + + for (i = 0; i < numfiles; i++) + toi_unregister_sysfs_file(tuxonice_kobj, &sysfs_params[i]); + + kobject_put(tuxonice_kobj); + toi_sysfs_initialised = 0; +} + +/* toi_initialise_sysfs + * + * Initialise the /sysfs/tuxonice directory. + */ + +static void toi_initialise_sysfs(void) +{ + int i; + int numfiles = sizeof(sysfs_params) / sizeof(struct toi_sysfs_data); + + if (toi_sysfs_initialised) + return; + + /* Make our TuxOnIce directory a child of /sys/power */ + tuxonice_kobj = kobject_create_and_add("tuxonice", power_kobj); + if (!tuxonice_kobj) + return; + + toi_sysfs_initialised = 1; + + for (i = 0; i < numfiles; i++) + toi_register_sysfs_file(tuxonice_kobj, &sysfs_params[i]); +} + +int toi_sysfs_init(void) +{ + toi_initialise_sysfs(); + return 0; +} + +void toi_sysfs_exit(void) +{ + toi_cleanup_sysfs(); +} diff --git b/kernel/power/tuxonice_sysfs.h b/kernel/power/tuxonice_sysfs.h new file mode 100644 index 0000000..1de954c --- /dev/null +++ b/kernel/power/tuxonice_sysfs.h @@ -0,0 +1,137 @@ +/* + * kernel/power/tuxonice_sysfs.h + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + */ + +#include + +struct toi_sysfs_data { + struct attribute attr; + int type; + int flags; + union { + struct { + unsigned long *bit_vector; + int bit; + } bit; + struct { + int *variable; + int minimum; + int maximum; + } integer; + struct { + long *variable; + long minimum; + long maximum; + } a_long; + struct { + unsigned long *variable; + unsigned long minimum; + unsigned long maximum; + } ul; + struct { + char *variable; + int max_length; + } string; + struct { + int (*read_sysfs) (const char *buffer, int count); + int (*write_sysfs) (const char *buffer, int count); + void *data; + } special; + } data; + + /* Side effects routine. Used, eg, for reparsing the + * resume= entry when it changes */ + void (*write_side_effect) (void); + struct list_head sysfs_data_list; +}; + +enum { + TOI_SYSFS_DATA_NONE = 1, + TOI_SYSFS_DATA_CUSTOM, + TOI_SYSFS_DATA_BIT, + TOI_SYSFS_DATA_INTEGER, + TOI_SYSFS_DATA_UL, + TOI_SYSFS_DATA_LONG, + TOI_SYSFS_DATA_STRING +}; + +#define SYSFS_WRITEONLY 0200 +#define SYSFS_READONLY 0444 +#define SYSFS_RW 0644 + +#define SYSFS_BIT(_name, _mode, _ul, _bit, _flags) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_BIT, \ + .flags = _flags, \ + .data = { .bit = { .bit_vector = _ul, .bit = _bit } } } + +#define SYSFS_INT(_name, _mode, _int, _min, _max, _flags, _wse) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_INTEGER, \ + .flags = _flags, \ + .data = { .integer = { .variable = _int, .minimum = _min, \ + .maximum = _max } }, \ + .write_side_effect = _wse } + +#define SYSFS_UL(_name, _mode, _ul, _min, _max, _flags) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_UL, \ + .flags = _flags, \ + .data = { .ul = { .variable = _ul, .minimum = _min, \ + .maximum = _max } } } + +#define SYSFS_LONG(_name, _mode, _long, _min, _max, _flags) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_LONG, \ + .flags = _flags, \ + .data = { .a_long = { .variable = _long, .minimum = _min, \ + .maximum = _max } } } + +#define SYSFS_STRING(_name, _mode, _string, _max_len, _flags, _wse) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_STRING, \ + .flags = _flags, \ + .data = { .string = { .variable = _string, .max_length = _max_len } }, \ + .write_side_effect = _wse } + +#define SYSFS_CUSTOM(_name, _mode, _read, _write, _flags, _wse) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_CUSTOM, \ + .flags = _flags, \ + .data = { .special = { .read_sysfs = _read, .write_sysfs = _write } }, \ + .write_side_effect = _wse } + +#define SYSFS_NONE(_name, _wse) { \ + .attr = {.name = _name , .mode = SYSFS_WRITEONLY }, \ + .type = TOI_SYSFS_DATA_NONE, \ + .write_side_effect = _wse, \ +} + +/* Flags */ +#define SYSFS_NEEDS_SM_FOR_READ 1 +#define SYSFS_NEEDS_SM_FOR_WRITE 2 +#define SYSFS_HIBERNATE 4 +#define SYSFS_RESUME 8 +#define SYSFS_HIBERNATE_OR_RESUME (SYSFS_HIBERNATE | SYSFS_RESUME) +#define SYSFS_HIBERNATING (SYSFS_HIBERNATE | SYSFS_NEEDS_SM_FOR_WRITE) +#define SYSFS_RESUMING (SYSFS_RESUME | SYSFS_NEEDS_SM_FOR_WRITE) +#define SYSFS_NEEDS_SM_FOR_BOTH \ + (SYSFS_NEEDS_SM_FOR_READ | SYSFS_NEEDS_SM_FOR_WRITE) + +int toi_register_sysfs_file(struct kobject *kobj, + struct toi_sysfs_data *toi_sysfs_data); +void toi_unregister_sysfs_file(struct kobject *kobj, + struct toi_sysfs_data *toi_sysfs_data); + +extern struct kobject *tuxonice_kobj; + +struct kobject *make_toi_sysdir(char *name); +void remove_toi_sysdir(struct kobject *obj); +extern void toi_cleanup_sysfs(void); + +extern int toi_sysfs_init(void); +extern void toi_sysfs_exit(void); diff --git b/kernel/power/tuxonice_ui.c b/kernel/power/tuxonice_ui.c new file mode 100644 index 0000000..76152f3 --- /dev/null +++ b/kernel/power/tuxonice_ui.c @@ -0,0 +1,247 @@ +/* + * kernel/power/tuxonice_ui.c + * + * Copyright (C) 1998-2001 Gabor Kuti + * Copyright (C) 1998,2001,2002 Pavel Machek + * Copyright (C) 2002-2003 Florent Chabaud + * Copyright (C) 2002-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Routines for TuxOnIce's user interface. + * + * The user interface code talks to a userspace program via a + * netlink socket. + * + * The kernel side: + * - starts the userui program; + * - sends text messages and progress bar status; + * + * The user space side: + * - passes messages regarding user requests (abort, toggle reboot etc) + * + */ + +#define __KERNEL_SYSCALLS__ + +#include + +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice.h" +#include "tuxonice_ui.h" +#include "tuxonice_netlink.h" +#include "tuxonice_power_off.h" +#include "tuxonice_builtin.h" + +static char local_printf_buf[1024]; /* Same as printk - should be safe */ +struct ui_ops *toi_current_ui; + +/** + * toi_wait_for_keypress - Wait for keypress via userui or /dev/console. + * + * @timeout: Maximum time to wait. + * + * Wait for a keypress, either from userui or /dev/console if userui isn't + * available. The non-userui path is particularly for at boot-time, prior + * to userui being started, when we have an important warning to give to + * the user. + */ +static char toi_wait_for_keypress(int timeout) +{ + if (toi_current_ui && toi_current_ui->wait_for_key(timeout)) + return ' '; + + return toi_wait_for_keypress_dev_console(timeout); +} + +/* toi_early_boot_message() + * Description: Handle errors early in the process of booting. + * The user may press C to continue booting, perhaps + * invalidating the image, or space to reboot. + * This works from either the serial console or normally + * attached keyboard. + * + * Note that we come in here from init, while the kernel is + * locked. If we want to get events from the serial console, + * we need to temporarily unlock the kernel. + * + * toi_early_boot_message may also be called post-boot. + * In this case, it simply printks the message and returns. + * + * Arguments: int Whether we are able to erase the image. + * int default_answer. What to do when we timeout. This + * will normally be continue, but the user might + * provide command line options (__setup) to override + * particular cases. + * Char *. Pointer to a string explaining why we're moaning. + */ + +#define say(message, a...) printk(KERN_EMERG message, ##a) + +void toi_early_boot_message(int message_detail, int default_answer, + char *warning_reason, ...) +{ +#if defined(CONFIG_VT) || defined(CONFIG_SERIAL_CONSOLE) + unsigned long orig_state = get_toi_state(), continue_req = 0; + unsigned long orig_loglevel = console_loglevel; + int can_ask = 1; +#else + int can_ask = 0; +#endif + + va_list args; + int printed_len; + + if (!toi_wait) { + set_toi_state(TOI_CONTINUE_REQ); + can_ask = 0; + } + + if (warning_reason) { + va_start(args, warning_reason); + printed_len = vsnprintf(local_printf_buf, + sizeof(local_printf_buf), + warning_reason, + args); + va_end(args); + } + + if (!test_toi_state(TOI_BOOT_TIME)) { + printk("TuxOnIce: %s\n", local_printf_buf); + return; + } + + if (!can_ask) { + continue_req = !!default_answer; + goto post_ask; + } + +#if defined(CONFIG_VT) || defined(CONFIG_SERIAL_CONSOLE) + console_loglevel = 7; + + say("=== TuxOnIce ===\n\n"); + if (warning_reason) { + say("BIG FAT WARNING!! %s\n\n", local_printf_buf); + switch (message_detail) { + case 0: + say("If you continue booting, note that any image WILL" + "NOT BE REMOVED.\nTuxOnIce is unable to do so " + "because the appropriate modules aren't\n" + "loaded. You should manually remove the image " + "to avoid any\npossibility of corrupting your " + "filesystem(s) later.\n"); + break; + case 1: + say("If you want to use the current TuxOnIce image, " + "reboot and try\nagain with the same kernel " + "that you hibernated from. If you want\n" + "to forget that image, continue and the image " + "will be erased.\n"); + break; + } + say("Press SPACE to reboot or C to continue booting with " + "this kernel\n\n"); + if (toi_wait > 0) + say("Default action if you don't select one in %d " + "seconds is: %s.\n", + toi_wait, + default_answer == TOI_CONTINUE_REQ ? + "continue booting" : "reboot"); + } else { + say("BIG FAT WARNING!!\n\n" + "You have tried to resume from this image before.\n" + "If it failed once, it may well fail again.\n" + "Would you like to remove the image and boot " + "normally?\nThis will be equivalent to entering " + "noresume on the\nkernel command line.\n\n" + "Press SPACE to remove the image or C to continue " + "resuming.\n\n"); + if (toi_wait > 0) + say("Default action if you don't select one in %d " + "seconds is: %s.\n", toi_wait, + !!default_answer ? + "continue resuming" : "remove the image"); + } + console_loglevel = orig_loglevel; + + set_toi_state(TOI_SANITY_CHECK_PROMPT); + clear_toi_state(TOI_CONTINUE_REQ); + + if (toi_wait_for_keypress(toi_wait) == 0) /* We timed out */ + continue_req = !!default_answer; + else + continue_req = test_toi_state(TOI_CONTINUE_REQ); + +#endif /* CONFIG_VT or CONFIG_SERIAL_CONSOLE */ + +post_ask: + if ((warning_reason) && (!continue_req)) + kernel_restart(NULL); + + restore_toi_state(orig_state); + if (continue_req) + set_toi_state(TOI_CONTINUE_REQ); +} + +#undef say + +/* + * User interface specific /sys/power/tuxonice entries. + */ + +static struct toi_sysfs_data sysfs_params[] = { +#if defined(CONFIG_NET) && defined(CONFIG_SYSFS) + SYSFS_INT("default_console_level", SYSFS_RW, + &toi_bkd.toi_default_console_level, 0, 7, 0, NULL), + SYSFS_UL("debug_sections", SYSFS_RW, &toi_bkd.toi_debug_state, 0, + 1 << 30, 0), + SYSFS_BIT("log_everything", SYSFS_RW, &toi_bkd.toi_action, TOI_LOGALL, + 0) +#endif +}; + +static struct toi_module_ops userui_ops = { + .type = MISC_HIDDEN_MODULE, + .name = "printk ui", + .directory = "user_interface", + .module = THIS_MODULE, + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +int toi_register_ui_ops(struct ui_ops *this_ui) +{ + if (toi_current_ui) { + printk(KERN_INFO "Only one TuxOnIce user interface module can " + "be loaded at a time."); + return -EBUSY; + } + + toi_current_ui = this_ui; + + return 0; +} + +void toi_remove_ui_ops(struct ui_ops *this_ui) +{ + if (toi_current_ui != this_ui) + return; + + toi_current_ui = NULL; +} + +/* toi_console_sysfs_init + * Description: Boot time initialisation for user interface. + */ + +int toi_ui_init(void) +{ + return toi_register_module(&userui_ops); +} + +void toi_ui_exit(void) +{ + toi_unregister_module(&userui_ops); +} diff --git b/kernel/power/tuxonice_ui.h b/kernel/power/tuxonice_ui.h new file mode 100644 index 0000000..4934e3a --- /dev/null +++ b/kernel/power/tuxonice_ui.h @@ -0,0 +1,97 @@ +/* + * kernel/power/tuxonice_ui.h + * + * Copyright (C) 2004-2015 Nigel Cunningham (nigel at nigelcunningham com au) + */ + +enum { + DONT_CLEAR_BAR, + CLEAR_BAR +}; + +enum { + /* Userspace -> Kernel */ + USERUI_MSG_ABORT = 0x11, + USERUI_MSG_SET_STATE = 0x12, + USERUI_MSG_GET_STATE = 0x13, + USERUI_MSG_GET_DEBUG_STATE = 0x14, + USERUI_MSG_SET_DEBUG_STATE = 0x15, + USERUI_MSG_SPACE = 0x18, + USERUI_MSG_GET_POWERDOWN_METHOD = 0x1A, + USERUI_MSG_SET_POWERDOWN_METHOD = 0x1B, + USERUI_MSG_GET_LOGLEVEL = 0x1C, + USERUI_MSG_SET_LOGLEVEL = 0x1D, + USERUI_MSG_PRINTK = 0x1E, + + /* Kernel -> Userspace */ + USERUI_MSG_MESSAGE = 0x21, + USERUI_MSG_PROGRESS = 0x22, + USERUI_MSG_POST_ATOMIC_RESTORE = 0x25, + + USERUI_MSG_MAX, +}; + +struct userui_msg_params { + u32 a, b, c, d; + char text[255]; +}; + +struct ui_ops { + char (*wait_for_key) (int timeout); + u32 (*update_status) (u32 value, u32 maximum, const char *fmt, ...); + void (*prepare_status) (int clearbar, const char *fmt, ...); + void (*cond_pause) (int pause, char *message); + void (*abort)(int result_code, const char *fmt, ...); + void (*prepare)(void); + void (*cleanup)(void); + void (*message)(u32 section, u32 level, u32 normally_logged, + const char *fmt, ...); +}; + +extern struct ui_ops *toi_current_ui; + +#define toi_update_status(val, max, fmt, args...) \ + (toi_current_ui ? (toi_current_ui->update_status) (val, max, fmt, ##args) : \ + max) + +#define toi_prepare_console(void) \ + do { if (toi_current_ui) \ + (toi_current_ui->prepare)(); \ + } while (0) + +#define toi_cleanup_console(void) \ + do { if (toi_current_ui) \ + (toi_current_ui->cleanup)(); \ + } while (0) + +#define abort_hibernate(result, fmt, args...) \ + do { if (toi_current_ui) \ + (toi_current_ui->abort)(result, fmt, ##args); \ + else { \ + set_abort_result(result); \ + } \ + } while (0) + +#define toi_cond_pause(pause, message) \ + do { if (toi_current_ui) \ + (toi_current_ui->cond_pause)(pause, message); \ + } while (0) + +#define toi_prepare_status(clear, fmt, args...) \ + do { if (toi_current_ui) \ + (toi_current_ui->prepare_status)(clear, fmt, ##args); \ + else \ + printk(KERN_INFO fmt "%s", ##args, "\n"); \ + } while (0) + +#define toi_message(sn, lev, log, fmt, a...) \ +do { \ + if (toi_current_ui && (!sn || test_debug_state(sn))) \ + toi_current_ui->message(sn, lev, log, fmt, ##a); \ +} while (0) + +__exit void toi_ui_cleanup(void); +extern int toi_ui_init(void); +extern void toi_ui_exit(void); +extern int toi_register_ui_ops(struct ui_ops *this_ui); +extern void toi_remove_ui_ops(struct ui_ops *this_ui); diff --git b/kernel/power/tuxonice_userui.c b/kernel/power/tuxonice_userui.c new file mode 100644 index 0000000..6aa5ac3 --- /dev/null +++ b/kernel/power/tuxonice_userui.c @@ -0,0 +1,658 @@ +/* + * kernel/power/user_ui.c + * + * Copyright (C) 2005-2007 Bernard Blackham + * Copyright (C) 2002-2015 Nigel Cunningham (nigel at nigelcunningham com au) + * + * This file is released under the GPLv2. + * + * Routines for TuxOnIce's user interface. + * + * The user interface code talks to a userspace program via a + * netlink socket. + * + * The kernel side: + * - starts the userui program; + * - sends text messages and progress bar status; + * + * The user space side: + * - passes messages regarding user requests (abort, toggle reboot etc) + * + */ + +#define __KERNEL_SYSCALLS__ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice.h" +#include "tuxonice_ui.h" +#include "tuxonice_netlink.h" +#include "tuxonice_power_off.h" + +static char local_printf_buf[1024]; /* Same as printk - should be safe */ + +static struct user_helper_data ui_helper_data; +static struct toi_module_ops userui_ops; +static int orig_kmsg; + +static char lastheader[512]; +static int lastheader_message_len; +static int ui_helper_changed; /* Used at resume-time so don't overwrite value + set from initrd/ramfs. */ + +/* Number of distinct progress amounts that userspace can display */ +static int progress_granularity = 30; + +static DECLARE_WAIT_QUEUE_HEAD(userui_wait_for_key); +static int userui_wait_should_wake; + +#define toi_stop_waiting_for_userui_key() \ +{ \ + userui_wait_should_wake = true; \ + wake_up_interruptible(&userui_wait_for_key); \ +} + +/** + * ui_nl_set_state - Update toi_action based on a message from userui. + * + * @n: The bit (1 << bit) to set. + */ +static void ui_nl_set_state(int n) +{ + /* Only let them change certain settings */ + static const u32 toi_action_mask = + (1 << TOI_REBOOT) | (1 << TOI_PAUSE) | + (1 << TOI_LOGALL) | + (1 << TOI_SINGLESTEP) | + (1 << TOI_PAUSE_NEAR_PAGESET_END); + static unsigned long new_action; + + new_action = (toi_bkd.toi_action & (~toi_action_mask)) | + (n & toi_action_mask); + + printk(KERN_DEBUG "n is %x. Action flags being changed from %lx " + "to %lx.", n, toi_bkd.toi_action, new_action); + toi_bkd.toi_action = new_action; + + if (!test_action_state(TOI_PAUSE) && + !test_action_state(TOI_SINGLESTEP)) + toi_stop_waiting_for_userui_key(); +} + +/** + * userui_post_atomic_restore - Tell userui that atomic restore just happened. + * + * Tell userui that atomic restore just occured, so that it can do things like + * redrawing the screen, re-getting settings and so on. + */ +static void userui_post_atomic_restore(struct toi_boot_kernel_data *bkd) +{ + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_POST_ATOMIC_RESTORE, NULL, 0); +} + +/** + * userui_storage_needed - Report how much memory in image header is needed. + */ +static int userui_storage_needed(void) +{ + return sizeof(ui_helper_data.program) + 1 + sizeof(int); +} + +/** + * userui_save_config_info - Fill buffer with config info for image header. + * + * @buf: Buffer into which to put the config info we want to save. + */ +static int userui_save_config_info(char *buf) +{ + *((int *) buf) = progress_granularity; + memcpy(buf + sizeof(int), ui_helper_data.program, + sizeof(ui_helper_data.program)); + return sizeof(ui_helper_data.program) + sizeof(int) + 1; +} + +/** + * userui_load_config_info - Restore config info from buffer. + * + * @buf: Buffer containing header info loaded. + * @size: Size of data loaded for this module. + */ +static void userui_load_config_info(char *buf, int size) +{ + progress_granularity = *((int *) buf); + size -= sizeof(int); + + /* Don't load the saved path if one has already been set */ + if (ui_helper_changed) + return; + + if (size > sizeof(ui_helper_data.program)) + size = sizeof(ui_helper_data.program); + + memcpy(ui_helper_data.program, buf + sizeof(int), size); + ui_helper_data.program[sizeof(ui_helper_data.program)-1] = '\0'; +} + +/** + * set_ui_program_set: Record that userui program was changed. + * + * Side effect routine for when the userui program is set. In an initrd or + * ramfs, the user may set a location for the userui program. If this happens, + * we don't want to reload the value that was saved in the image header. This + * routine allows us to flag that we shouldn't restore the program name from + * the image header. + */ +static void set_ui_program_set(void) +{ + ui_helper_changed = 1; +} + +/** + * userui_memory_needed - Tell core how much memory to reserve for us. + */ +static int userui_memory_needed(void) +{ + /* ball park figure of 128 pages */ + return 128 * PAGE_SIZE; +} + +/** + * userui_update_status - Update the progress bar and (if on) in-bar message. + * + * @value: Current progress percentage numerator. + * @maximum: Current progress percentage denominator. + * @fmt: Message to be displayed in the middle of the progress bar. + * + * Note that a NULL message does not mean that any previous message is erased! + * For that, you need toi_prepare_status with clearbar on. + * + * Returns an unsigned long, being the next numerator (as determined by the + * maximum and progress granularity) where status needs to be updated. + * This is to reduce unnecessary calls to update_status. + */ +static u32 userui_update_status(u32 value, u32 maximum, const char *fmt, ...) +{ + static u32 last_step = 9999; + struct userui_msg_params msg; + u32 this_step, next_update; + int bitshift; + + if (ui_helper_data.pid == -1) + return 0; + + if ((!maximum) || (!progress_granularity)) + return maximum; + + if (value < 0) + value = 0; + + if (value > maximum) + value = maximum; + + /* Try to avoid math problems - we can't do 64 bit math here + * (and shouldn't need it - anyone got screen resolution + * of 65536 pixels or more?) */ + bitshift = fls(maximum) - 16; + if (bitshift > 0) { + u32 temp_maximum = maximum >> bitshift; + u32 temp_value = value >> bitshift; + this_step = (u32) + (temp_value * progress_granularity / temp_maximum); + next_update = (((this_step + 1) * temp_maximum / + progress_granularity) + 1) << bitshift; + } else { + this_step = (u32) (value * progress_granularity / maximum); + next_update = ((this_step + 1) * maximum / + progress_granularity) + 1; + } + + if (this_step == last_step) + return next_update; + + memset(&msg, 0, sizeof(msg)); + + msg.a = this_step; + msg.b = progress_granularity; + + if (fmt) { + va_list args; + va_start(args, fmt); + vsnprintf(msg.text, sizeof(msg.text), fmt, args); + va_end(args); + msg.text[sizeof(msg.text)-1] = '\0'; + } + + toi_send_netlink_message(&ui_helper_data, USERUI_MSG_PROGRESS, + &msg, sizeof(msg)); + last_step = this_step; + + return next_update; +} + +/** + * userui_message - Display a message without necessarily logging it. + * + * @section: Type of message. Messages can be filtered by type. + * @level: Degree of importance of the message. Lower values = higher priority. + * @normally_logged: Whether logged even if log_everything is off. + * @fmt: Message (and parameters). + * + * This function is intended to do the same job as printk, but without normally + * logging what is printed. The point is to be able to get debugging info on + * screen without filling the logs with "1/534. ^M 2/534^M. 3/534^M" + * + * It may be called from an interrupt context - can't sleep! + */ +static void userui_message(u32 section, u32 level, u32 normally_logged, + const char *fmt, ...) +{ + struct userui_msg_params msg; + + if ((level) && (level > console_loglevel)) + return; + + memset(&msg, 0, sizeof(msg)); + + msg.a = section; + msg.b = level; + msg.c = normally_logged; + + if (fmt) { + va_list args; + va_start(args, fmt); + vsnprintf(msg.text, sizeof(msg.text), fmt, args); + va_end(args); + msg.text[sizeof(msg.text)-1] = '\0'; + } + + if (test_action_state(TOI_LOGALL)) + printk(KERN_INFO "%s\n", msg.text); + + toi_send_netlink_message(&ui_helper_data, USERUI_MSG_MESSAGE, + &msg, sizeof(msg)); +} + +/** + * wait_for_key_via_userui - Wait for userui to receive a keypress. + */ +static void wait_for_key_via_userui(void) +{ + DECLARE_WAITQUEUE(wait, current); + + add_wait_queue(&userui_wait_for_key, &wait); + set_current_state(TASK_INTERRUPTIBLE); + + wait_event_interruptible(userui_wait_for_key, userui_wait_should_wake); + userui_wait_should_wake = false; + + set_current_state(TASK_RUNNING); + remove_wait_queue(&userui_wait_for_key, &wait); +} + +/** + * userui_prepare_status - Display high level messages. + * + * @clearbar: Whether to clear the progress bar. + * @fmt...: New message for the title. + * + * Prepare the 'nice display', drawing the header and version, along with the + * current action and perhaps also resetting the progress bar. + */ +static void userui_prepare_status(int clearbar, const char *fmt, ...) +{ + va_list args; + + if (fmt) { + va_start(args, fmt); + lastheader_message_len = vsnprintf(lastheader, 512, fmt, args); + va_end(args); + } + + if (clearbar) + toi_update_status(0, 1, NULL); + + if (ui_helper_data.pid == -1) + printk(KERN_EMERG "%s\n", lastheader); + else + toi_message(0, TOI_STATUS, 1, lastheader, NULL); +} + +/** + * toi_wait_for_keypress - Wait for keypress via userui. + * + * @timeout: Maximum time to wait. + * + * Wait for a keypress from userui. + * + * FIXME: Implement timeout? + */ +static char userui_wait_for_keypress(int timeout) +{ + char key = '\0'; + + if (ui_helper_data.pid != -1) { + wait_for_key_via_userui(); + key = ' '; + } + + return key; +} + +/** + * userui_abort_hibernate - Abort a cycle & tell user if they didn't request it. + * + * @result_code: Reason why we're aborting (1 << bit). + * @fmt: Message to display if telling the user what's going on. + * + * Abort a cycle. If this wasn't at the user's request (and we're displaying + * output), tell the user why and wait for them to acknowledge the message. + */ +static void userui_abort_hibernate(int result_code, const char *fmt, ...) +{ + va_list args; + int printed_len = 0; + + set_result_state(result_code); + + if (test_result_state(TOI_ABORTED)) + return; + + set_result_state(TOI_ABORTED); + + if (test_result_state(TOI_ABORT_REQUESTED)) + return; + + va_start(args, fmt); + printed_len = vsnprintf(local_printf_buf, sizeof(local_printf_buf), + fmt, args); + va_end(args); + if (ui_helper_data.pid != -1) + printed_len = sprintf(local_printf_buf + printed_len, + " (Press SPACE to continue)"); + + toi_prepare_status(CLEAR_BAR, "%s", local_printf_buf); + + if (ui_helper_data.pid != -1) + userui_wait_for_keypress(0); +} + +/** + * request_abort_hibernate - Abort hibernating or resuming at user request. + * + * Handle the user requesting the cancellation of a hibernation or resume by + * pressing escape. + */ +static void request_abort_hibernate(void) +{ + if (test_result_state(TOI_ABORT_REQUESTED) || + !test_action_state(TOI_CAN_CANCEL)) + return; + + if (test_toi_state(TOI_NOW_RESUMING)) { + toi_prepare_status(CLEAR_BAR, "Escape pressed. " + "Powering down again."); + set_toi_state(TOI_STOP_RESUME); + while (!test_toi_state(TOI_IO_STOPPED)) + schedule(); + if (toiActiveAllocator->mark_resume_attempted) + toiActiveAllocator->mark_resume_attempted(0); + toi_power_down(); + } + + toi_prepare_status(CLEAR_BAR, "--- ESCAPE PRESSED :" + " ABORTING HIBERNATION ---"); + set_abort_result(TOI_ABORT_REQUESTED); + toi_stop_waiting_for_userui_key(); +} + +/** + * userui_user_rcv_msg - Receive a netlink message from userui. + * + * @skb: skb received. + * @nlh: Netlink header received. + */ +static int userui_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) +{ + int type; + int *data; + + type = nlh->nlmsg_type; + + /* A control message: ignore them */ + if (type < NETLINK_MSG_BASE) + return 0; + + /* Unknown message: reply with EINVAL */ + if (type >= USERUI_MSG_MAX) + return -EINVAL; + + /* All operations require privileges, even GET */ + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + /* Only allow one task to receive NOFREEZE privileges */ + if (type == NETLINK_MSG_NOFREEZE_ME && ui_helper_data.pid != -1) { + printk(KERN_INFO "Got NOFREEZE_ME request when " + "ui_helper_data.pid is %d.\n", ui_helper_data.pid); + return -EBUSY; + } + + data = (int *) NLMSG_DATA(nlh); + + switch (type) { + case USERUI_MSG_ABORT: + request_abort_hibernate(); + return 0; + case USERUI_MSG_GET_STATE: + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_GET_STATE, &toi_bkd.toi_action, + sizeof(toi_bkd.toi_action)); + return 0; + case USERUI_MSG_GET_DEBUG_STATE: + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_GET_DEBUG_STATE, + &toi_bkd.toi_debug_state, + sizeof(toi_bkd.toi_debug_state)); + return 0; + case USERUI_MSG_SET_STATE: + if (nlh->nlmsg_len < NLMSG_LENGTH(sizeof(int))) + return -EINVAL; + ui_nl_set_state(*data); + return 0; + case USERUI_MSG_SET_DEBUG_STATE: + if (nlh->nlmsg_len < NLMSG_LENGTH(sizeof(int))) + return -EINVAL; + toi_bkd.toi_debug_state = (*data); + return 0; + case USERUI_MSG_SPACE: + toi_stop_waiting_for_userui_key(); + return 0; + case USERUI_MSG_GET_POWERDOWN_METHOD: + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_GET_POWERDOWN_METHOD, + &toi_poweroff_method, + sizeof(toi_poweroff_method)); + return 0; + case USERUI_MSG_SET_POWERDOWN_METHOD: + if (nlh->nlmsg_len != NLMSG_LENGTH(sizeof(char))) + return -EINVAL; + toi_poweroff_method = (unsigned long)(*data); + return 0; + case USERUI_MSG_GET_LOGLEVEL: + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_GET_LOGLEVEL, + &toi_bkd.toi_default_console_level, + sizeof(toi_bkd.toi_default_console_level)); + return 0; + case USERUI_MSG_SET_LOGLEVEL: + if (nlh->nlmsg_len < NLMSG_LENGTH(sizeof(int))) + return -EINVAL; + toi_bkd.toi_default_console_level = (*data); + return 0; + case USERUI_MSG_PRINTK: + printk(KERN_INFO "%s", (char *) data); + return 0; + } + + /* Unhandled here */ + return 1; +} + +/** + * userui_cond_pause - Possibly pause at user request. + * + * @pause: Whether to pause or just display the message. + * @message: Message to display at the start of pausing. + * + * Potentially pause and wait for the user to tell us to continue. We normally + * only pause when @pause is set. While paused, the user can do things like + * changing the loglevel, toggling the display of debugging sections and such + * like. + */ +static void userui_cond_pause(int pause, char *message) +{ + int displayed_message = 0, last_key = 0; + + while (last_key != 32 && + ui_helper_data.pid != -1 && + ((test_action_state(TOI_PAUSE) && pause) || + (test_action_state(TOI_SINGLESTEP)))) { + if (!displayed_message) { + toi_prepare_status(DONT_CLEAR_BAR, + "%s Press SPACE to continue.%s", + message ? message : "", + (test_action_state(TOI_SINGLESTEP)) ? + " Single step on." : ""); + displayed_message = 1; + } + last_key = userui_wait_for_keypress(0); + } + schedule(); +} + +/** + * userui_prepare_console - Prepare the console for use. + * + * Prepare a console for use, saving current kmsg settings and attempting to + * start userui. Console loglevel changes are handled by userui. + */ +static void userui_prepare_console(void) +{ + orig_kmsg = vt_kmsg_redirect(fg_console + 1); + + ui_helper_data.pid = -1; + + if (!userui_ops.enabled) { + printk(KERN_INFO "TuxOnIce: Userui disabled.\n"); + return; + } + + if (*ui_helper_data.program) + toi_netlink_setup(&ui_helper_data); + else + printk(KERN_INFO "TuxOnIce: Userui program not configured.\n"); +} + +/** + * userui_cleanup_console - Cleanup after a cycle. + * + * Tell userui to cleanup, and restore kmsg_redirect to its original value. + */ + +static void userui_cleanup_console(void) +{ + if (ui_helper_data.pid > -1) + toi_netlink_close(&ui_helper_data); + + vt_kmsg_redirect(orig_kmsg); +} + +/* + * User interface specific /sys/power/tuxonice entries. + */ + +static struct toi_sysfs_data sysfs_params[] = { +#if defined(CONFIG_NET) && defined(CONFIG_SYSFS) + SYSFS_BIT("enable_escape", SYSFS_RW, &toi_bkd.toi_action, + TOI_CAN_CANCEL, 0), + SYSFS_BIT("pause_between_steps", SYSFS_RW, &toi_bkd.toi_action, + TOI_PAUSE, 0), + SYSFS_INT("enabled", SYSFS_RW, &userui_ops.enabled, 0, 1, 0, NULL), + SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1, + 2048, 0, NULL), + SYSFS_STRING("program", SYSFS_RW, ui_helper_data.program, 255, 0, + set_ui_program_set), + SYSFS_INT("debug", SYSFS_RW, &ui_helper_data.debug, 0, 1, 0, NULL) +#endif +}; + +static struct toi_module_ops userui_ops = { + .type = MISC_MODULE, + .name = "userui", + .shared_directory = "user_interface", + .module = THIS_MODULE, + .storage_needed = userui_storage_needed, + .save_config_info = userui_save_config_info, + .load_config_info = userui_load_config_info, + .memory_needed = userui_memory_needed, + .post_atomic_restore = userui_post_atomic_restore, + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +static struct ui_ops my_ui_ops = { + .update_status = userui_update_status, + .message = userui_message, + .prepare_status = userui_prepare_status, + .abort = userui_abort_hibernate, + .cond_pause = userui_cond_pause, + .prepare = userui_prepare_console, + .cleanup = userui_cleanup_console, + .wait_for_key = userui_wait_for_keypress, +}; + +/** + * toi_user_ui_init - Boot time initialisation for user interface. + * + * Invoked from the core init routine. + */ +static __init int toi_user_ui_init(void) +{ + int result; + + ui_helper_data.nl = NULL; + strncpy(ui_helper_data.program, CONFIG_TOI_USERUI_DEFAULT_PATH, 255); + ui_helper_data.pid = -1; + ui_helper_data.skb_size = sizeof(struct userui_msg_params); + ui_helper_data.pool_limit = 6; + ui_helper_data.netlink_id = NETLINK_TOI_USERUI; + ui_helper_data.name = "userspace ui"; + ui_helper_data.rcv_msg = userui_user_rcv_msg; + ui_helper_data.interface_version = 8; + ui_helper_data.must_init = 0; + ui_helper_data.not_ready = userui_cleanup_console; + init_completion(&ui_helper_data.wait_for_process); + result = toi_register_module(&userui_ops); + if (!result) { + result = toi_register_ui_ops(&my_ui_ops); + if (result) + toi_unregister_module(&userui_ops); + } + + return result; +} + +late_initcall(toi_user_ui_init); diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index c963ba5..464c10a 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -284,6 +285,20 @@ static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN); static char *log_buf = __log_buf; static u32 log_buf_len = __LOG_BUF_LEN; +#ifdef CONFIG_TOI_INCREMENTAL +void toi_set_logbuf_untracked(void) +{ + int i; + struct page *log_buf_start_page = virt_to_page(__log_buf); + + printk("Not protecting kernel printk log buffer (%p-%p).\n", + __log_buf, __log_buf + __LOG_BUF_LEN); + + for (i = 0; i < (1 << (CONFIG_LOG_BUF_SHIFT - PAGE_SHIFT)); i++) + SetPageTOI_Untracked(log_buf_start_page + i); +} +#endif + /* Return log buffer address */ char *log_buf_addr_get(void) { @@ -1455,9 +1470,9 @@ static void call_console_drivers(int level, !(con->flags & CON_ANYTIME)) continue; if (con->flags & CON_EXTENDED) - con->write(con, ext_text, ext_len); + con->write(con, ext_text, ext_len, level); else - con->write(con, text, len); + con->write(con, text, len, level); } } @@ -1978,7 +1993,7 @@ asmlinkage __visible void early_printk(const char *fmt, ...) n = vscnprintf(buf, sizeof(buf), fmt, ap); va_end(ap); - early_console->write(early_console, buf, n); + early_console->write(early_console, buf, n, 0); } #endif diff --git a/kernel/profile.c b/kernel/profile.c index 99513e1..5136969 100644 --- a/kernel/profile.c +++ b/kernel/profile.c @@ -59,6 +59,7 @@ int profile_setup(char *str) if (!strncmp(str, sleepstr, strlen(sleepstr))) { #ifdef CONFIG_SCHEDSTATS + force_schedstat_enabled(); prof_on = SLEEP_PROFILING; if (str[strlen(sleepstr)] == ',') str += strlen(sleepstr) + 1; diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e41dd41..9fd5b62 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1614,7 +1614,6 @@ static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp) int needmore; struct rcu_data *rdp = this_cpu_ptr(rsp->rda); - rcu_nocb_gp_cleanup(rsp, rnp); rnp->need_future_gp[c & 0x1] = 0; needmore = rnp->need_future_gp[(c + 1) & 0x1]; trace_rcu_future_gp(rnp, rdp, c, @@ -1635,7 +1634,7 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp) !READ_ONCE(rsp->gp_flags) || !rsp->gp_kthread) return; - wake_up(&rsp->gp_wq); + swake_up(&rsp->gp_wq); } /* @@ -2010,6 +2009,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp) int nocb = 0; struct rcu_data *rdp; struct rcu_node *rnp = rcu_get_root(rsp); + struct swait_queue_head *sq; WRITE_ONCE(rsp->gp_activity, jiffies); raw_spin_lock_irq_rcu_node(rnp); @@ -2046,7 +2046,9 @@ static void rcu_gp_cleanup(struct rcu_state *rsp) needgp = __note_gp_changes(rsp, rnp, rdp) || needgp; /* smp_mb() provided by prior unlock-lock pair. */ nocb += rcu_future_gp_cleanup(rsp, rnp); + sq = rcu_nocb_gp_get(rnp); raw_spin_unlock_irq(&rnp->lock); + rcu_nocb_gp_cleanup(sq); cond_resched_rcu_qs(); WRITE_ONCE(rsp->gp_activity, jiffies); rcu_gp_slow(rsp, gp_cleanup_delay); @@ -2092,7 +2094,7 @@ static int __noreturn rcu_gp_kthread(void *arg) READ_ONCE(rsp->gpnum), TPS("reqwait")); rsp->gp_state = RCU_GP_WAIT_GPS; - wait_event_interruptible(rsp->gp_wq, + swait_event_interruptible(rsp->gp_wq, READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_INIT); rsp->gp_state = RCU_GP_DONE_GPS; @@ -2122,7 +2124,7 @@ static int __noreturn rcu_gp_kthread(void *arg) READ_ONCE(rsp->gpnum), TPS("fqswait")); rsp->gp_state = RCU_GP_WAIT_FQS; - ret = wait_event_interruptible_timeout(rsp->gp_wq, + ret = swait_event_interruptible_timeout(rsp->gp_wq, rcu_gp_fqs_check_wake(rsp, &gf), j); rsp->gp_state = RCU_GP_DOING_FQS; /* Locking provides needed memory barriers. */ @@ -2246,7 +2248,7 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags) WARN_ON_ONCE(!rcu_gp_in_progress(rsp)); WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS); raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); - rcu_gp_kthread_wake(rsp); + swake_up(&rsp->gp_wq); /* Memory barrier implied by swake_up() path. */ } /* @@ -2900,7 +2902,7 @@ static void force_quiescent_state(struct rcu_state *rsp) } WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS); raw_spin_unlock_irqrestore(&rnp_old->lock, flags); - rcu_gp_kthread_wake(rsp); + swake_up(&rsp->gp_wq); /* Memory barrier implied by swake_up() path. */ } /* @@ -3529,7 +3531,7 @@ static void __rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp, raw_spin_unlock_irqrestore(&rnp->lock, flags); if (wake) { smp_mb(); /* EGP done before wake_up(). */ - wake_up(&rsp->expedited_wq); + swake_up(&rsp->expedited_wq); } break; } @@ -3780,7 +3782,7 @@ static void synchronize_sched_expedited_wait(struct rcu_state *rsp) jiffies_start = jiffies; for (;;) { - ret = wait_event_interruptible_timeout( + ret = swait_event_timeout( rsp->expedited_wq, sync_rcu_preempt_exp_done(rnp_root), jiffies_stall); @@ -3788,7 +3790,7 @@ static void synchronize_sched_expedited_wait(struct rcu_state *rsp) return; if (ret < 0) { /* Hit a signal, disable CPU stall warnings. */ - wait_event(rsp->expedited_wq, + swait_event(rsp->expedited_wq, sync_rcu_preempt_exp_done(rnp_root)); return; } @@ -4482,8 +4484,8 @@ static void __init rcu_init_one(struct rcu_state *rsp) } } - init_waitqueue_head(&rsp->gp_wq); - init_waitqueue_head(&rsp->expedited_wq); + init_swait_queue_head(&rsp->gp_wq); + init_swait_queue_head(&rsp->expedited_wq); rnp = rsp->level[rcu_num_lvls - 1]; for_each_possible_cpu(i) { while (i > rnp->grphi) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 83360b4..bbd235d 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -27,6 +27,7 @@ #include #include #include +#include #include /* @@ -243,7 +244,7 @@ struct rcu_node { /* Refused to boost: not sure why, though. */ /* This can happen due to race conditions. */ #ifdef CONFIG_RCU_NOCB_CPU - wait_queue_head_t nocb_gp_wq[2]; + struct swait_queue_head nocb_gp_wq[2]; /* Place for rcu_nocb_kthread() to wait GP. */ #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ int need_future_gp[2]; @@ -399,7 +400,7 @@ struct rcu_data { atomic_long_t nocb_q_count_lazy; /* invocation (all stages). */ struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */ struct rcu_head **nocb_follower_tail; - wait_queue_head_t nocb_wq; /* For nocb kthreads to sleep on. */ + struct swait_queue_head nocb_wq; /* For nocb kthreads to sleep on. */ struct task_struct *nocb_kthread; int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ @@ -478,7 +479,7 @@ struct rcu_state { unsigned long gpnum; /* Current gp number. */ unsigned long completed; /* # of last completed gp. */ struct task_struct *gp_kthread; /* Task for grace periods. */ - wait_queue_head_t gp_wq; /* Where GP task waits. */ + struct swait_queue_head gp_wq; /* Where GP task waits. */ short gp_flags; /* Commands for GP task. */ short gp_state; /* GP kthread sleep state. */ @@ -506,7 +507,7 @@ struct rcu_state { unsigned long expedited_sequence; /* Take a ticket. */ atomic_long_t expedited_normal; /* # fallbacks to normal. */ atomic_t expedited_need_qs; /* # CPUs left to check in. */ - wait_queue_head_t expedited_wq; /* Wait for check-ins. */ + struct swait_queue_head expedited_wq; /* Wait for check-ins. */ int ncpus_snap; /* # CPUs seen last time. */ unsigned long jiffies_force_qs; /* Time at which to invoke */ @@ -621,7 +622,8 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp); static void increment_cpu_stall_ticks(void); static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu); static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq); -static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp); +static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp); +static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq); static void rcu_init_one_nocb(struct rcu_node *rnp); static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, bool lazy, unsigned long flags); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 9467a8b..080bd20 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1811,9 +1811,9 @@ early_param("rcu_nocb_poll", parse_rcu_nocb_poll); * Wake up any no-CBs CPUs' kthreads that were waiting on the just-ended * grace period. */ -static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp) +static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq) { - wake_up_all(&rnp->nocb_gp_wq[rnp->completed & 0x1]); + swake_up_all(sq); } /* @@ -1829,10 +1829,15 @@ static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq) rnp->need_future_gp[(rnp->completed + 1) & 0x1] += nrq; } +static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp) +{ + return &rnp->nocb_gp_wq[rnp->completed & 0x1]; +} + static void rcu_init_one_nocb(struct rcu_node *rnp) { - init_waitqueue_head(&rnp->nocb_gp_wq[0]); - init_waitqueue_head(&rnp->nocb_gp_wq[1]); + init_swait_queue_head(&rnp->nocb_gp_wq[0]); + init_swait_queue_head(&rnp->nocb_gp_wq[1]); } #ifndef CONFIG_RCU_NOCB_CPU_ALL @@ -1857,7 +1862,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force) if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) { /* Prior smp_mb__after_atomic() orders against prior enqueue. */ WRITE_ONCE(rdp_leader->nocb_leader_sleep, false); - wake_up(&rdp_leader->nocb_wq); + swake_up(&rdp_leader->nocb_wq); } } @@ -2069,7 +2074,7 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp) */ trace_rcu_future_gp(rnp, rdp, c, TPS("StartWait")); for (;;) { - wait_event_interruptible( + swait_event_interruptible( rnp->nocb_gp_wq[c & 0x1], (d = ULONG_CMP_GE(READ_ONCE(rnp->completed), c))); if (likely(d)) @@ -2097,7 +2102,7 @@ wait_again: /* Wait for callbacks to appear. */ if (!rcu_nocb_poll) { trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Sleep"); - wait_event_interruptible(my_rdp->nocb_wq, + swait_event_interruptible(my_rdp->nocb_wq, !READ_ONCE(my_rdp->nocb_leader_sleep)); /* Memory barrier handled by smp_mb() calls below and repoll. */ } else if (firsttime) { @@ -2172,7 +2177,7 @@ wait_again: * List was empty, wake up the follower. * Memory barriers supplied by atomic_long_add(). */ - wake_up(&rdp->nocb_wq); + swake_up(&rdp->nocb_wq); } } @@ -2193,7 +2198,7 @@ static void nocb_follower_wait(struct rcu_data *rdp) if (!rcu_nocb_poll) { trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "FollowerSleep"); - wait_event_interruptible(rdp->nocb_wq, + swait_event_interruptible(rdp->nocb_wq, READ_ONCE(rdp->nocb_follower_head)); } else if (firsttime) { /* Don't drown trace log with "Poll"! */ @@ -2352,7 +2357,7 @@ void __init rcu_init_nohz(void) static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) { rdp->nocb_tail = &rdp->nocb_head; - init_waitqueue_head(&rdp->nocb_wq); + init_swait_queue_head(&rdp->nocb_wq); rdp->nocb_follower_tail = &rdp->nocb_follower_head; } @@ -2502,7 +2507,7 @@ static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu) return false; } -static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp) +static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq) { } @@ -2510,6 +2515,11 @@ static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq) { } +static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp) +{ + return NULL; +} + static void rcu_init_one_nocb(struct rcu_node *rnp) { } diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 6768797..7d4cba2 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -13,7 +13,7 @@ endif obj-y += core.o loadavg.o clock.o cputime.o obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o -obj-y += wait.o completion.o idle.o +obj-y += wait.o swait.o completion.o idle.o obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 41f6b22..05114b1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -67,12 +67,10 @@ #include #include #include -#include #include #include #include #include -#include #include #include @@ -125,138 +123,6 @@ const_debug unsigned int sysctl_sched_features = #undef SCHED_FEAT -#ifdef CONFIG_SCHED_DEBUG -#define SCHED_FEAT(name, enabled) \ - #name , - -static const char * const sched_feat_names[] = { -#include "features.h" -}; - -#undef SCHED_FEAT - -static int sched_feat_show(struct seq_file *m, void *v) -{ - int i; - - for (i = 0; i < __SCHED_FEAT_NR; i++) { - if (!(sysctl_sched_features & (1UL << i))) - seq_puts(m, "NO_"); - seq_printf(m, "%s ", sched_feat_names[i]); - } - seq_puts(m, "\n"); - - return 0; -} - -#ifdef HAVE_JUMP_LABEL - -#define jump_label_key__true STATIC_KEY_INIT_TRUE -#define jump_label_key__false STATIC_KEY_INIT_FALSE - -#define SCHED_FEAT(name, enabled) \ - jump_label_key__##enabled , - -struct static_key sched_feat_keys[__SCHED_FEAT_NR] = { -#include "features.h" -}; - -#undef SCHED_FEAT - -static void sched_feat_disable(int i) -{ - static_key_disable(&sched_feat_keys[i]); -} - -static void sched_feat_enable(int i) -{ - static_key_enable(&sched_feat_keys[i]); -} -#else -static void sched_feat_disable(int i) { }; -static void sched_feat_enable(int i) { }; -#endif /* HAVE_JUMP_LABEL */ - -static int sched_feat_set(char *cmp) -{ - int i; - int neg = 0; - - if (strncmp(cmp, "NO_", 3) == 0) { - neg = 1; - cmp += 3; - } - - for (i = 0; i < __SCHED_FEAT_NR; i++) { - if (strcmp(cmp, sched_feat_names[i]) == 0) { - if (neg) { - sysctl_sched_features &= ~(1UL << i); - sched_feat_disable(i); - } else { - sysctl_sched_features |= (1UL << i); - sched_feat_enable(i); - } - break; - } - } - - return i; -} - -static ssize_t -sched_feat_write(struct file *filp, const char __user *ubuf, - size_t cnt, loff_t *ppos) -{ - char buf[64]; - char *cmp; - int i; - struct inode *inode; - - if (cnt > 63) - cnt = 63; - - if (copy_from_user(&buf, ubuf, cnt)) - return -EFAULT; - - buf[cnt] = 0; - cmp = strstrip(buf); - - /* Ensure the static_key remains in a consistent state */ - inode = file_inode(filp); - inode_lock(inode); - i = sched_feat_set(cmp); - inode_unlock(inode); - if (i == __SCHED_FEAT_NR) - return -EINVAL; - - *ppos += cnt; - - return cnt; -} - -static int sched_feat_open(struct inode *inode, struct file *filp) -{ - return single_open(filp, sched_feat_show, NULL); -} - -static const struct file_operations sched_feat_fops = { - .open = sched_feat_open, - .write = sched_feat_write, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - -static __init int sched_init_debug(void) -{ - debugfs_create_file("sched_features", 0644, NULL, NULL, - &sched_feat_fops); - - return 0; -} -late_initcall(sched_init_debug); -#endif /* CONFIG_SCHED_DEBUG */ - /* * Number of tasks to iterate in a single balance run. * Limited because this is done with IRQs disabled. @@ -2094,7 +1960,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) ttwu_queue(p, cpu); stat: - ttwu_stat(p, cpu, wake_flags); + if (schedstat_enabled()) + ttwu_stat(p, cpu, wake_flags); out: raw_spin_unlock_irqrestore(&p->pi_lock, flags); @@ -2142,7 +2009,8 @@ static void try_to_wake_up_local(struct task_struct *p) ttwu_activate(rq, p, ENQUEUE_WAKEUP); ttwu_do_wakeup(rq, p, 0); - ttwu_stat(p, smp_processor_id(), 0); + if (schedstat_enabled()) + ttwu_stat(p, smp_processor_id(), 0); out: raw_spin_unlock(&p->pi_lock); } @@ -2184,7 +2052,6 @@ void __dl_clear_params(struct task_struct *p) dl_se->dl_bw = 0; dl_se->dl_throttled = 0; - dl_se->dl_new = 1; dl_se->dl_yielded = 0; } @@ -2211,6 +2078,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) #endif #ifdef CONFIG_SCHEDSTATS + /* Even if schedstat is disabled, there should not be garbage */ memset(&p->se.statistics, 0, sizeof(p->se.statistics)); #endif @@ -2219,6 +2087,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) __dl_clear_params(p); INIT_LIST_HEAD(&p->rt.run_list); + p->rt.timeout = 0; + p->rt.time_slice = sched_rr_timeslice; + p->rt.on_rq = 0; + p->rt.on_list = 0; #ifdef CONFIG_PREEMPT_NOTIFIERS INIT_HLIST_HEAD(&p->preempt_notifiers); @@ -2282,6 +2154,69 @@ int sysctl_numa_balancing(struct ctl_table *table, int write, #endif #endif +DEFINE_STATIC_KEY_FALSE(sched_schedstats); + +#ifdef CONFIG_SCHEDSTATS +static void set_schedstats(bool enabled) +{ + if (enabled) + static_branch_enable(&sched_schedstats); + else + static_branch_disable(&sched_schedstats); +} + +void force_schedstat_enabled(void) +{ + if (!schedstat_enabled()) { + pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n"); + static_branch_enable(&sched_schedstats); + } +} + +static int __init setup_schedstats(char *str) +{ + int ret = 0; + if (!str) + goto out; + + if (!strcmp(str, "enable")) { + set_schedstats(true); + ret = 1; + } else if (!strcmp(str, "disable")) { + set_schedstats(false); + ret = 1; + } +out: + if (!ret) + pr_warn("Unable to parse schedstats=\n"); + + return ret; +} +__setup("schedstats=", setup_schedstats); + +#ifdef CONFIG_PROC_SYSCTL +int sysctl_schedstats(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table t; + int err; + int state = static_branch_likely(&sched_schedstats); + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + t = *table; + t.data = &state; + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); + if (err < 0) + return err; + if (write) + set_schedstats(state); + return err; +} +#endif +#endif + /* * fork()/clone()-time setup: */ @@ -3011,16 +2946,6 @@ u64 scheduler_tick_max_deferment(void) } #endif -notrace unsigned long get_parent_ip(unsigned long addr) -{ - if (in_lock_functions(addr)) { - addr = CALLER_ADDR2; - if (in_lock_functions(addr)) - addr = CALLER_ADDR3; - } - return addr; -} - #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ defined(CONFIG_PREEMPT_TRACER)) @@ -3042,7 +2967,7 @@ void preempt_count_add(int val) PREEMPT_MASK - 10); #endif if (preempt_count() == val) { - unsigned long ip = get_parent_ip(CALLER_ADDR1); + unsigned long ip = get_lock_parent_ip(); #ifdef CONFIG_DEBUG_PREEMPT current->preempt_disable_ip = ip; #endif @@ -3069,7 +2994,7 @@ void preempt_count_sub(int val) #endif if (preempt_count() == val) - trace_preempt_on(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1)); + trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip()); __preempt_count_sub(val); } EXPORT_SYMBOL(preempt_count_sub); @@ -3281,7 +3206,6 @@ static void __sched notrace __schedule(bool preempt) trace_sched_switch(preempt, prev, next); rq = context_switch(rq, prev, next); /* unlocks the rq */ - cpu = cpu_of(rq); } else { lockdep_unpin_lock(&rq->lock); raw_spin_unlock_irq(&rq->lock); @@ -3467,7 +3391,7 @@ EXPORT_SYMBOL(default_wake_function); */ void rt_mutex_setprio(struct task_struct *p, int prio) { - int oldprio, queued, running, enqueue_flag = ENQUEUE_RESTORE; + int oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE; struct rq *rq; const struct sched_class *prev_class; @@ -3495,11 +3419,15 @@ void rt_mutex_setprio(struct task_struct *p, int prio) trace_sched_pi_setprio(p, prio); oldprio = p->prio; + + if (oldprio == prio) + queue_flag &= ~DEQUEUE_MOVE; + prev_class = p->sched_class; queued = task_on_rq_queued(p); running = task_current(rq, p); if (queued) - dequeue_task(rq, p, DEQUEUE_SAVE); + dequeue_task(rq, p, queue_flag); if (running) put_prev_task(rq, p); @@ -3517,7 +3445,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio) if (!dl_prio(p->normal_prio) || (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) { p->dl.dl_boosted = 1; - enqueue_flag |= ENQUEUE_REPLENISH; + queue_flag |= ENQUEUE_REPLENISH; } else p->dl.dl_boosted = 0; p->sched_class = &dl_sched_class; @@ -3525,7 +3453,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio) if (dl_prio(oldprio)) p->dl.dl_boosted = 0; if (oldprio < prio) - enqueue_flag |= ENQUEUE_HEAD; + queue_flag |= ENQUEUE_HEAD; p->sched_class = &rt_sched_class; } else { if (dl_prio(oldprio)) @@ -3540,7 +3468,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio) if (running) p->sched_class->set_curr_task(rq); if (queued) - enqueue_task(rq, p, enqueue_flag); + enqueue_task(rq, p, queue_flag); check_class_changed(rq, p, prev_class, oldprio); out_unlock: @@ -3896,6 +3824,7 @@ static int __sched_setscheduler(struct task_struct *p, const struct sched_class *prev_class; struct rq *rq; int reset_on_fork; + int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE; /* may grab non-irq protected spin_locks */ BUG_ON(in_interrupt()); @@ -4078,17 +4007,14 @@ change: * itself. */ new_effective_prio = rt_mutex_get_effective_prio(p, newprio); - if (new_effective_prio == oldprio) { - __setscheduler_params(p, attr); - task_rq_unlock(rq, p, &flags); - return 0; - } + if (new_effective_prio == oldprio) + queue_flags &= ~DEQUEUE_MOVE; } queued = task_on_rq_queued(p); running = task_current(rq, p); if (queued) - dequeue_task(rq, p, DEQUEUE_SAVE); + dequeue_task(rq, p, queue_flags); if (running) put_prev_task(rq, p); @@ -4098,15 +4024,14 @@ change: if (running) p->sched_class->set_curr_task(rq); if (queued) { - int enqueue_flags = ENQUEUE_RESTORE; /* * We enqueue to tail when the priority of a task is * increased (user space view). */ - if (oldprio <= p->prio) - enqueue_flags |= ENQUEUE_HEAD; + if (oldprio < p->prio) + queue_flags |= ENQUEUE_HEAD; - enqueue_task(rq, p, enqueue_flags); + enqueue_task(rq, p, queue_flags); } check_class_changed(rq, p, prev_class, oldprio); @@ -5408,183 +5333,6 @@ static void migrate_tasks(struct rq *dead_rq) } #endif /* CONFIG_HOTPLUG_CPU */ -#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL) - -static struct ctl_table sd_ctl_dir[] = { - { - .procname = "sched_domain", - .mode = 0555, - }, - {} -}; - -static struct ctl_table sd_ctl_root[] = { - { - .procname = "kernel", - .mode = 0555, - .child = sd_ctl_dir, - }, - {} -}; - -static struct ctl_table *sd_alloc_ctl_entry(int n) -{ - struct ctl_table *entry = - kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL); - - return entry; -} - -static void sd_free_ctl_entry(struct ctl_table **tablep) -{ - struct ctl_table *entry; - - /* - * In the intermediate directories, both the child directory and - * procname are dynamically allocated and could fail but the mode - * will always be set. In the lowest directory the names are - * static strings and all have proc handlers. - */ - for (entry = *tablep; entry->mode; entry++) { - if (entry->child) - sd_free_ctl_entry(&entry->child); - if (entry->proc_handler == NULL) - kfree(entry->procname); - } - - kfree(*tablep); - *tablep = NULL; -} - -static int min_load_idx = 0; -static int max_load_idx = CPU_LOAD_IDX_MAX-1; - -static void -set_table_entry(struct ctl_table *entry, - const char *procname, void *data, int maxlen, - umode_t mode, proc_handler *proc_handler, - bool load_idx) -{ - entry->procname = procname; - entry->data = data; - entry->maxlen = maxlen; - entry->mode = mode; - entry->proc_handler = proc_handler; - - if (load_idx) { - entry->extra1 = &min_load_idx; - entry->extra2 = &max_load_idx; - } -} - -static struct ctl_table * -sd_alloc_ctl_domain_table(struct sched_domain *sd) -{ - struct ctl_table *table = sd_alloc_ctl_entry(14); - - if (table == NULL) - return NULL; - - set_table_entry(&table[0], "min_interval", &sd->min_interval, - sizeof(long), 0644, proc_doulongvec_minmax, false); - set_table_entry(&table[1], "max_interval", &sd->max_interval, - sizeof(long), 0644, proc_doulongvec_minmax, false); - set_table_entry(&table[2], "busy_idx", &sd->busy_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[3], "idle_idx", &sd->idle_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[4], "newidle_idx", &sd->newidle_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[5], "wake_idx", &sd->wake_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[6], "forkexec_idx", &sd->forkexec_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[7], "busy_factor", &sd->busy_factor, - sizeof(int), 0644, proc_dointvec_minmax, false); - set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct, - sizeof(int), 0644, proc_dointvec_minmax, false); - set_table_entry(&table[9], "cache_nice_tries", - &sd->cache_nice_tries, - sizeof(int), 0644, proc_dointvec_minmax, false); - set_table_entry(&table[10], "flags", &sd->flags, - sizeof(int), 0644, proc_dointvec_minmax, false); - set_table_entry(&table[11], "max_newidle_lb_cost", - &sd->max_newidle_lb_cost, - sizeof(long), 0644, proc_doulongvec_minmax, false); - set_table_entry(&table[12], "name", sd->name, - CORENAME_MAX_SIZE, 0444, proc_dostring, false); - /* &table[13] is terminator */ - - return table; -} - -static struct ctl_table *sd_alloc_ctl_cpu_table(int cpu) -{ - struct ctl_table *entry, *table; - struct sched_domain *sd; - int domain_num = 0, i; - char buf[32]; - - for_each_domain(cpu, sd) - domain_num++; - entry = table = sd_alloc_ctl_entry(domain_num + 1); - if (table == NULL) - return NULL; - - i = 0; - for_each_domain(cpu, sd) { - snprintf(buf, 32, "domain%d", i); - entry->procname = kstrdup(buf, GFP_KERNEL); - entry->mode = 0555; - entry->child = sd_alloc_ctl_domain_table(sd); - entry++; - i++; - } - return table; -} - -static struct ctl_table_header *sd_sysctl_header; -static void register_sched_domain_sysctl(void) -{ - int i, cpu_num = num_possible_cpus(); - struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1); - char buf[32]; - - WARN_ON(sd_ctl_dir[0].child); - sd_ctl_dir[0].child = entry; - - if (entry == NULL) - return; - - for_each_possible_cpu(i) { - snprintf(buf, 32, "cpu%d", i); - entry->procname = kstrdup(buf, GFP_KERNEL); - entry->mode = 0555; - entry->child = sd_alloc_ctl_cpu_table(i); - entry++; - } - - WARN_ON(sd_sysctl_header); - sd_sysctl_header = register_sysctl_table(sd_ctl_root); -} - -/* may be called multiple times per register */ -static void unregister_sched_domain_sysctl(void) -{ - unregister_sysctl_table(sd_sysctl_header); - sd_sysctl_header = NULL; - if (sd_ctl_dir[0].child) - sd_free_ctl_entry(&sd_ctl_dir[0].child); -} -#else -static void register_sched_domain_sysctl(void) -{ -} -static void unregister_sched_domain_sysctl(void) -{ -} -#endif /* CONFIG_SCHED_DEBUG && CONFIG_SYSCTL */ - static void set_rq_online(struct rq *rq) { if (!rq->online) { @@ -6176,11 +5924,16 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) /* Setup the mask of cpus configured for isolated domains */ static int __init isolated_cpu_setup(char *str) { + int ret; + alloc_bootmem_cpumask_var(&cpu_isolated_map); - cpulist_parse(str, cpu_isolated_map); + ret = cpulist_parse(str, cpu_isolated_map); + if (ret) { + pr_err("sched: Error, all isolcpus= values must be between 0 and %d\n", nr_cpu_ids); + return 0; + } return 1; } - __setup("isolcpus=", isolated_cpu_setup); struct s_data { @@ -7863,11 +7616,9 @@ void sched_destroy_group(struct task_group *tg) void sched_offline_group(struct task_group *tg) { unsigned long flags; - int i; /* end participation in shares distribution */ - for_each_possible_cpu(i) - unregister_fair_sched_group(tg, i); + unregister_fair_sched_group(tg); spin_lock_irqsave(&task_group_lock, flags); list_del_rcu(&tg->list); @@ -7893,7 +7644,7 @@ void sched_move_task(struct task_struct *tsk) queued = task_on_rq_queued(tsk); if (queued) - dequeue_task(rq, tsk, DEQUEUE_SAVE); + dequeue_task(rq, tsk, DEQUEUE_SAVE | DEQUEUE_MOVE); if (unlikely(running)) put_prev_task(rq, tsk); @@ -7917,7 +7668,7 @@ void sched_move_task(struct task_struct *tsk) if (unlikely(running)) tsk->sched_class->set_curr_task(rq); if (queued) - enqueue_task(rq, tsk, ENQUEUE_RESTORE); + enqueue_task(rq, tsk, ENQUEUE_RESTORE | ENQUEUE_MOVE); task_rq_unlock(rq, tsk, &flags); } diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index b2ab2ff..75f98c5 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -262,21 +262,21 @@ static __always_inline bool steal_account_process_tick(void) #ifdef CONFIG_PARAVIRT if (static_key_false(¶virt_steal_enabled)) { u64 steal; - cputime_t steal_ct; + unsigned long steal_jiffies; steal = paravirt_steal_clock(smp_processor_id()); steal -= this_rq()->prev_steal_time; /* - * cputime_t may be less precise than nsecs (eg: if it's - * based on jiffies). Lets cast the result to cputime + * steal is in nsecs but our caller is expecting steal + * time in jiffies. Lets cast the result to jiffies * granularity and account the rest on the next rounds. */ - steal_ct = nsecs_to_cputime(steal); - this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct); + steal_jiffies = nsecs_to_jiffies(steal); + this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies); - account_steal_time(steal_ct); - return steal_ct; + account_steal_time(jiffies_to_cputime(steal_jiffies)); + return steal_jiffies; } #endif return false; @@ -668,26 +668,25 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -static unsigned long long vtime_delta(struct task_struct *tsk) +static cputime_t vtime_delta(struct task_struct *tsk) { - unsigned long long clock; + unsigned long now = READ_ONCE(jiffies); - clock = local_clock(); - if (clock < tsk->vtime_snap) + if (time_before(now, (unsigned long)tsk->vtime_snap)) return 0; - return clock - tsk->vtime_snap; + return jiffies_to_cputime(now - tsk->vtime_snap); } static cputime_t get_vtime_delta(struct task_struct *tsk) { - unsigned long long delta = vtime_delta(tsk); + unsigned long now = READ_ONCE(jiffies); + unsigned long delta = now - tsk->vtime_snap; WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); - tsk->vtime_snap += delta; + tsk->vtime_snap = now; - /* CHECKME: always safe to convert nsecs to cputime? */ - return nsecs_to_cputime(delta); + return jiffies_to_cputime(delta); } static void __vtime_account_system(struct task_struct *tsk) @@ -699,6 +698,9 @@ static void __vtime_account_system(struct task_struct *tsk) void vtime_account_system(struct task_struct *tsk) { + if (!vtime_delta(tsk)) + return; + write_seqcount_begin(&tsk->vtime_seqcount); __vtime_account_system(tsk); write_seqcount_end(&tsk->vtime_seqcount); @@ -707,7 +709,8 @@ void vtime_account_system(struct task_struct *tsk) void vtime_gen_account_irq_exit(struct task_struct *tsk) { write_seqcount_begin(&tsk->vtime_seqcount); - __vtime_account_system(tsk); + if (vtime_delta(tsk)) + __vtime_account_system(tsk); if (context_tracking_in_user()) tsk->vtime_snap_whence = VTIME_USER; write_seqcount_end(&tsk->vtime_seqcount); @@ -718,16 +721,19 @@ void vtime_account_user(struct task_struct *tsk) cputime_t delta_cpu; write_seqcount_begin(&tsk->vtime_seqcount); - delta_cpu = get_vtime_delta(tsk); tsk->vtime_snap_whence = VTIME_SYS; - account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu)); + if (vtime_delta(tsk)) { + delta_cpu = get_vtime_delta(tsk); + account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu)); + } write_seqcount_end(&tsk->vtime_seqcount); } void vtime_user_enter(struct task_struct *tsk) { write_seqcount_begin(&tsk->vtime_seqcount); - __vtime_account_system(tsk); + if (vtime_delta(tsk)) + __vtime_account_system(tsk); tsk->vtime_snap_whence = VTIME_USER; write_seqcount_end(&tsk->vtime_seqcount); } @@ -742,7 +748,8 @@ void vtime_guest_enter(struct task_struct *tsk) * that can thus safely catch up with a tickless delta. */ write_seqcount_begin(&tsk->vtime_seqcount); - __vtime_account_system(tsk); + if (vtime_delta(tsk)) + __vtime_account_system(tsk); current->flags |= PF_VCPU; write_seqcount_end(&tsk->vtime_seqcount); } @@ -772,7 +779,7 @@ void arch_vtime_task_switch(struct task_struct *prev) write_seqcount_begin(¤t->vtime_seqcount); current->vtime_snap_whence = VTIME_SYS; - current->vtime_snap = sched_clock_cpu(smp_processor_id()); + current->vtime_snap = jiffies; write_seqcount_end(¤t->vtime_seqcount); } @@ -783,7 +790,7 @@ void vtime_init_idle(struct task_struct *t, int cpu) local_irq_save(flags); write_seqcount_begin(&t->vtime_seqcount); t->vtime_snap_whence = VTIME_SYS; - t->vtime_snap = sched_clock_cpu(cpu); + t->vtime_snap = jiffies; write_seqcount_end(&t->vtime_seqcount); local_irq_restore(flags); } diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 57b939c..c7a036f 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -352,7 +352,15 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); - WARN_ON(!dl_se->dl_new || dl_se->dl_throttled); + WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline)); + + /* + * We are racing with the deadline timer. So, do nothing because + * the deadline timer handler will take care of properly recharging + * the runtime and postponing the deadline + */ + if (dl_se->dl_throttled) + return; /* * We use the regular wall clock time to set deadlines in the @@ -361,7 +369,6 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se, */ dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline; dl_se->runtime = pi_se->dl_runtime; - dl_se->dl_new = 0; } /* @@ -399,6 +406,9 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se, dl_se->runtime = pi_se->dl_runtime; } + if (dl_se->dl_yielded && dl_se->runtime > 0) + dl_se->runtime = 0; + /* * We keep moving the deadline away until we get some * available runtime for the entity. This ensures correct @@ -500,15 +510,6 @@ static void update_dl_entity(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); - /* - * The arrival of a new instance needs special treatment, i.e., - * the actual scheduling parameters have to be "renewed". - */ - if (dl_se->dl_new) { - setup_new_dl_entity(dl_se, pi_se); - return; - } - if (dl_time_before(dl_se->deadline, rq_clock(rq)) || dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) { dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline; @@ -605,16 +606,6 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) } /* - * This is possible if switched_from_dl() raced against a running - * callback that took the above !dl_task() path and we've since then - * switched back into SCHED_DEADLINE. - * - * There's nothing to do except drop our task reference. - */ - if (dl_se->dl_new) - goto unlock; - - /* * The task might have been boosted by someone else and might be in the * boosting/deboosting path, its not throttled. */ @@ -735,8 +726,11 @@ static void update_curr_dl(struct rq *rq) * approach need further study. */ delta_exec = rq_clock_task(rq) - curr->se.exec_start; - if (unlikely((s64)delta_exec <= 0)) + if (unlikely((s64)delta_exec <= 0)) { + if (unlikely(dl_se->dl_yielded)) + goto throttle; return; + } schedstat_set(curr->se.statistics.exec_max, max(curr->se.statistics.exec_max, delta_exec)); @@ -749,8 +743,10 @@ static void update_curr_dl(struct rq *rq) sched_rt_avg_update(rq, delta_exec); - dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec; - if (dl_runtime_exceeded(dl_se)) { + dl_se->runtime -= delta_exec; + +throttle: + if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) { dl_se->dl_throttled = 1; __dequeue_task_dl(rq, curr, 0); if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr))) @@ -917,7 +913,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, * parameters of the task might need updating. Otherwise, * we want a replenishment of its runtime. */ - if (dl_se->dl_new || flags & ENQUEUE_WAKEUP) + if (flags & ENQUEUE_WAKEUP) update_dl_entity(dl_se, pi_se); else if (flags & ENQUEUE_REPLENISH) replenish_dl_entity(dl_se, pi_se); @@ -994,18 +990,14 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) */ static void yield_task_dl(struct rq *rq) { - struct task_struct *p = rq->curr; - /* * We make the task go to sleep until its current deadline by * forcing its runtime to zero. This way, update_curr_dl() stops * it and the bandwidth timer will wake it up and will give it * new scheduling parameters (thanks to dl_yielded=1). */ - if (p->dl.runtime > 0) { - rq->curr->dl.dl_yielded = 1; - p->dl.runtime = 0; - } + rq->curr->dl.dl_yielded = 1; + update_rq_clock(rq); update_curr_dl(rq); /* @@ -1722,6 +1714,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) */ static void switched_to_dl(struct rq *rq, struct task_struct *p) { + if (dl_time_before(p->dl.deadline, rq_clock(rq))) + setup_new_dl_entity(&p->dl, &p->dl); + if (task_on_rq_queued(p) && rq->curr != p) { #ifdef CONFIG_SMP if (p->nr_cpus_allowed > 1 && rq->dl.overloaded) @@ -1768,8 +1763,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p, */ resched_curr(rq); #endif /* CONFIG_SMP */ - } else - switched_to_dl(rq, p); + } } const struct sched_class dl_sched_class = { diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 6415117..4fbc3bd 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "sched.h" @@ -58,6 +59,309 @@ static unsigned long nsec_low(unsigned long long nsec) #define SPLIT_NS(x) nsec_high(x), nsec_low(x) +#define SCHED_FEAT(name, enabled) \ + #name , + +static const char * const sched_feat_names[] = { +#include "features.h" +}; + +#undef SCHED_FEAT + +static int sched_feat_show(struct seq_file *m, void *v) +{ + int i; + + for (i = 0; i < __SCHED_FEAT_NR; i++) { + if (!(sysctl_sched_features & (1UL << i))) + seq_puts(m, "NO_"); + seq_printf(m, "%s ", sched_feat_names[i]); + } + seq_puts(m, "\n"); + + return 0; +} + +#ifdef HAVE_JUMP_LABEL + +#define jump_label_key__true STATIC_KEY_INIT_TRUE +#define jump_label_key__false STATIC_KEY_INIT_FALSE + +#define SCHED_FEAT(name, enabled) \ + jump_label_key__##enabled , + +struct static_key sched_feat_keys[__SCHED_FEAT_NR] = { +#include "features.h" +}; + +#undef SCHED_FEAT + +static void sched_feat_disable(int i) +{ + static_key_disable(&sched_feat_keys[i]); +} + +static void sched_feat_enable(int i) +{ + static_key_enable(&sched_feat_keys[i]); +} +#else +static void sched_feat_disable(int i) { }; +static void sched_feat_enable(int i) { }; +#endif /* HAVE_JUMP_LABEL */ + +static int sched_feat_set(char *cmp) +{ + int i; + int neg = 0; + + if (strncmp(cmp, "NO_", 3) == 0) { + neg = 1; + cmp += 3; + } + + for (i = 0; i < __SCHED_FEAT_NR; i++) { + if (strcmp(cmp, sched_feat_names[i]) == 0) { + if (neg) { + sysctl_sched_features &= ~(1UL << i); + sched_feat_disable(i); + } else { + sysctl_sched_features |= (1UL << i); + sched_feat_enable(i); + } + break; + } + } + + return i; +} + +static ssize_t +sched_feat_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[64]; + char *cmp; + int i; + struct inode *inode; + + if (cnt > 63) + cnt = 63; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + + buf[cnt] = 0; + cmp = strstrip(buf); + + /* Ensure the static_key remains in a consistent state */ + inode = file_inode(filp); + inode_lock(inode); + i = sched_feat_set(cmp); + inode_unlock(inode); + if (i == __SCHED_FEAT_NR) + return -EINVAL; + + *ppos += cnt; + + return cnt; +} + +static int sched_feat_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_feat_show, NULL); +} + +static const struct file_operations sched_feat_fops = { + .open = sched_feat_open, + .write = sched_feat_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static __init int sched_init_debug(void) +{ + debugfs_create_file("sched_features", 0644, NULL, NULL, + &sched_feat_fops); + + return 0; +} +late_initcall(sched_init_debug); + +#ifdef CONFIG_SMP + +#ifdef CONFIG_SYSCTL + +static struct ctl_table sd_ctl_dir[] = { + { + .procname = "sched_domain", + .mode = 0555, + }, + {} +}; + +static struct ctl_table sd_ctl_root[] = { + { + .procname = "kernel", + .mode = 0555, + .child = sd_ctl_dir, + }, + {} +}; + +static struct ctl_table *sd_alloc_ctl_entry(int n) +{ + struct ctl_table *entry = + kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL); + + return entry; +} + +static void sd_free_ctl_entry(struct ctl_table **tablep) +{ + struct ctl_table *entry; + + /* + * In the intermediate directories, both the child directory and + * procname are dynamically allocated and could fail but the mode + * will always be set. In the lowest directory the names are + * static strings and all have proc handlers. + */ + for (entry = *tablep; entry->mode; entry++) { + if (entry->child) + sd_free_ctl_entry(&entry->child); + if (entry->proc_handler == NULL) + kfree(entry->procname); + } + + kfree(*tablep); + *tablep = NULL; +} + +static int min_load_idx = 0; +static int max_load_idx = CPU_LOAD_IDX_MAX-1; + +static void +set_table_entry(struct ctl_table *entry, + const char *procname, void *data, int maxlen, + umode_t mode, proc_handler *proc_handler, + bool load_idx) +{ + entry->procname = procname; + entry->data = data; + entry->maxlen = maxlen; + entry->mode = mode; + entry->proc_handler = proc_handler; + + if (load_idx) { + entry->extra1 = &min_load_idx; + entry->extra2 = &max_load_idx; + } +} + +static struct ctl_table * +sd_alloc_ctl_domain_table(struct sched_domain *sd) +{ + struct ctl_table *table = sd_alloc_ctl_entry(14); + + if (table == NULL) + return NULL; + + set_table_entry(&table[0], "min_interval", &sd->min_interval, + sizeof(long), 0644, proc_doulongvec_minmax, false); + set_table_entry(&table[1], "max_interval", &sd->max_interval, + sizeof(long), 0644, proc_doulongvec_minmax, false); + set_table_entry(&table[2], "busy_idx", &sd->busy_idx, + sizeof(int), 0644, proc_dointvec_minmax, true); + set_table_entry(&table[3], "idle_idx", &sd->idle_idx, + sizeof(int), 0644, proc_dointvec_minmax, true); + set_table_entry(&table[4], "newidle_idx", &sd->newidle_idx, + sizeof(int), 0644, proc_dointvec_minmax, true); + set_table_entry(&table[5], "wake_idx", &sd->wake_idx, + sizeof(int), 0644, proc_dointvec_minmax, true); + set_table_entry(&table[6], "forkexec_idx", &sd->forkexec_idx, + sizeof(int), 0644, proc_dointvec_minmax, true); + set_table_entry(&table[7], "busy_factor", &sd->busy_factor, + sizeof(int), 0644, proc_dointvec_minmax, false); + set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct, + sizeof(int), 0644, proc_dointvec_minmax, false); + set_table_entry(&table[9], "cache_nice_tries", + &sd->cache_nice_tries, + sizeof(int), 0644, proc_dointvec_minmax, false); + set_table_entry(&table[10], "flags", &sd->flags, + sizeof(int), 0644, proc_dointvec_minmax, false); + set_table_entry(&table[11], "max_newidle_lb_cost", + &sd->max_newidle_lb_cost, + sizeof(long), 0644, proc_doulongvec_minmax, false); + set_table_entry(&table[12], "name", sd->name, + CORENAME_MAX_SIZE, 0444, proc_dostring, false); + /* &table[13] is terminator */ + + return table; +} + +static struct ctl_table *sd_alloc_ctl_cpu_table(int cpu) +{ + struct ctl_table *entry, *table; + struct sched_domain *sd; + int domain_num = 0, i; + char buf[32]; + + for_each_domain(cpu, sd) + domain_num++; + entry = table = sd_alloc_ctl_entry(domain_num + 1); + if (table == NULL) + return NULL; + + i = 0; + for_each_domain(cpu, sd) { + snprintf(buf, 32, "domain%d", i); + entry->procname = kstrdup(buf, GFP_KERNEL); + entry->mode = 0555; + entry->child = sd_alloc_ctl_domain_table(sd); + entry++; + i++; + } + return table; +} + +static struct ctl_table_header *sd_sysctl_header; +void register_sched_domain_sysctl(void) +{ + int i, cpu_num = num_possible_cpus(); + struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1); + char buf[32]; + + WARN_ON(sd_ctl_dir[0].child); + sd_ctl_dir[0].child = entry; + + if (entry == NULL) + return; + + for_each_possible_cpu(i) { + snprintf(buf, 32, "cpu%d", i); + entry->procname = kstrdup(buf, GFP_KERNEL); + entry->mode = 0555; + entry->child = sd_alloc_ctl_cpu_table(i); + entry++; + } + + WARN_ON(sd_sysctl_header); + sd_sysctl_header = register_sysctl_table(sd_ctl_root); +} + +/* may be called multiple times per register */ +void unregister_sched_domain_sysctl(void) +{ + unregister_sysctl_table(sd_sysctl_header); + sd_sysctl_header = NULL; + if (sd_ctl_dir[0].child) + sd_free_ctl_entry(&sd_ctl_dir[0].child); +} +#endif /* CONFIG_SYSCTL */ +#endif /* CONFIG_SMP */ + #ifdef CONFIG_FAIR_GROUP_SCHED static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg) { @@ -75,16 +379,18 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group PN(se->vruntime); PN(se->sum_exec_runtime); #ifdef CONFIG_SCHEDSTATS - PN(se->statistics.wait_start); - PN(se->statistics.sleep_start); - PN(se->statistics.block_start); - PN(se->statistics.sleep_max); - PN(se->statistics.block_max); - PN(se->statistics.exec_max); - PN(se->statistics.slice_max); - PN(se->statistics.wait_max); - PN(se->statistics.wait_sum); - P(se->statistics.wait_count); + if (schedstat_enabled()) { + PN(se->statistics.wait_start); + PN(se->statistics.sleep_start); + PN(se->statistics.block_start); + PN(se->statistics.sleep_max); + PN(se->statistics.block_max); + PN(se->statistics.exec_max); + PN(se->statistics.slice_max); + PN(se->statistics.wait_max); + PN(se->statistics.wait_sum); + P(se->statistics.wait_count); + } #endif P(se->load.weight); #ifdef CONFIG_SMP @@ -122,10 +428,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p) (long long)(p->nvcsw + p->nivcsw), p->prio); #ifdef CONFIG_SCHEDSTATS - SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", - SPLIT_NS(p->se.statistics.wait_sum), - SPLIT_NS(p->se.sum_exec_runtime), - SPLIT_NS(p->se.statistics.sum_sleep_runtime)); + if (schedstat_enabled()) { + SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", + SPLIT_NS(p->se.statistics.wait_sum), + SPLIT_NS(p->se.sum_exec_runtime), + SPLIT_NS(p->se.statistics.sum_sleep_runtime)); + } #else SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", 0LL, 0L, @@ -258,8 +566,17 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq) void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq) { + struct dl_bw *dl_bw; + SEQ_printf(m, "\ndl_rq[%d]:\n", cpu); SEQ_printf(m, " .%-30s: %ld\n", "dl_nr_running", dl_rq->dl_nr_running); +#ifdef CONFIG_SMP + dl_bw = &cpu_rq(cpu)->rd->dl_bw; +#else + dl_bw = &dl_rq->dl_bw; +#endif + SEQ_printf(m, " .%-30s: %lld\n", "dl_bw->bw", dl_bw->bw); + SEQ_printf(m, " .%-30s: %lld\n", "dl_bw->total_bw", dl_bw->total_bw); } extern __read_mostly int sched_clock_running; @@ -313,17 +630,18 @@ do { \ #define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n); #define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n); - P(yld_count); - - P(sched_count); - P(sched_goidle); #ifdef CONFIG_SMP P64(avg_idle); P64(max_idle_balance_cost); #endif - P(ttwu_count); - P(ttwu_local); + if (schedstat_enabled()) { + P(yld_count); + P(sched_count); + P(sched_goidle); + P(ttwu_count); + P(ttwu_local); + } #undef P #undef P64 @@ -569,38 +887,39 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) nr_switches = p->nvcsw + p->nivcsw; #ifdef CONFIG_SCHEDSTATS - PN(se.statistics.sum_sleep_runtime); - PN(se.statistics.wait_start); - PN(se.statistics.sleep_start); - PN(se.statistics.block_start); - PN(se.statistics.sleep_max); - PN(se.statistics.block_max); - PN(se.statistics.exec_max); - PN(se.statistics.slice_max); - PN(se.statistics.wait_max); - PN(se.statistics.wait_sum); - P(se.statistics.wait_count); - PN(se.statistics.iowait_sum); - P(se.statistics.iowait_count); P(se.nr_migrations); - P(se.statistics.nr_migrations_cold); - P(se.statistics.nr_failed_migrations_affine); - P(se.statistics.nr_failed_migrations_running); - P(se.statistics.nr_failed_migrations_hot); - P(se.statistics.nr_forced_migrations); - P(se.statistics.nr_wakeups); - P(se.statistics.nr_wakeups_sync); - P(se.statistics.nr_wakeups_migrate); - P(se.statistics.nr_wakeups_local); - P(se.statistics.nr_wakeups_remote); - P(se.statistics.nr_wakeups_affine); - P(se.statistics.nr_wakeups_affine_attempts); - P(se.statistics.nr_wakeups_passive); - P(se.statistics.nr_wakeups_idle); - { + if (schedstat_enabled()) { u64 avg_atom, avg_per_cpu; + PN(se.statistics.sum_sleep_runtime); + PN(se.statistics.wait_start); + PN(se.statistics.sleep_start); + PN(se.statistics.block_start); + PN(se.statistics.sleep_max); + PN(se.statistics.block_max); + PN(se.statistics.exec_max); + PN(se.statistics.slice_max); + PN(se.statistics.wait_max); + PN(se.statistics.wait_sum); + P(se.statistics.wait_count); + PN(se.statistics.iowait_sum); + P(se.statistics.iowait_count); + P(se.statistics.nr_migrations_cold); + P(se.statistics.nr_failed_migrations_affine); + P(se.statistics.nr_failed_migrations_running); + P(se.statistics.nr_failed_migrations_hot); + P(se.statistics.nr_forced_migrations); + P(se.statistics.nr_wakeups); + P(se.statistics.nr_wakeups_sync); + P(se.statistics.nr_wakeups_migrate); + P(se.statistics.nr_wakeups_local); + P(se.statistics.nr_wakeups_remote); + P(se.statistics.nr_wakeups_affine); + P(se.statistics.nr_wakeups_affine_attempts); + P(se.statistics.nr_wakeups_passive); + P(se.statistics.nr_wakeups_idle); + avg_atom = p->se.sum_exec_runtime; if (nr_switches) avg_atom = div64_ul(avg_atom, nr_switches); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56b7d4b..ac7fb39 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -20,8 +20,8 @@ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra */ -#include #include +#include #include #include #include @@ -47,8 +47,13 @@ * (to see the precise effective timeslice length of your workload, * run vmstat and monitor the context-switches (cs) field) */ +#ifdef CONFIG_PCK_INTERACTIVE +unsigned int sysctl_sched_latency = 3000000ULL; +unsigned int normalized_sysctl_sched_latency = 3000000ULL; +#else unsigned int sysctl_sched_latency = 6000000ULL; unsigned int normalized_sysctl_sched_latency = 6000000ULL; +#endif /* * The initial- and re-scaling of tunables is configurable @@ -66,13 +71,22 @@ enum sched_tunable_scaling sysctl_sched_tunable_scaling * Minimal preemption granularity for CPU-bound tasks: * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds) */ +#ifdef CONFIG_PCK_INTERACTIVE +unsigned int sysctl_sched_min_granularity = 300000ULL; +unsigned int normalized_sysctl_sched_min_granularity = 300000ULL; +#else unsigned int sysctl_sched_min_granularity = 750000ULL; unsigned int normalized_sysctl_sched_min_granularity = 750000ULL; +#endif /* * is kept at sysctl_sched_latency / sysctl_sched_min_granularity */ +#ifdef CONFIG_PCK_INTERACTIVE +static unsigned int sched_nr_latency = 10; +#else static unsigned int sched_nr_latency = 8; +#endif /* * After fork, child runs first. If set to 0 (default) then @@ -88,10 +102,17 @@ unsigned int sysctl_sched_child_runs_first __read_mostly; * and reduces their over-scheduling. Synchronous workloads will still * have immediate wakeup/sleep latencies. */ +#ifdef CONFIG_PCK_INTERACTIVE +unsigned int sysctl_sched_wakeup_granularity = 500000UL; +unsigned int normalized_sysctl_sched_wakeup_granularity = 500000UL; + +const_debug unsigned int sysctl_sched_migration_cost = 250000UL; +#else unsigned int sysctl_sched_wakeup_granularity = 1000000UL; unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; const_debug unsigned int sysctl_sched_migration_cost = 500000UL; +#endif /* * The exponential sliding window over which load is averaged for shares @@ -111,8 +132,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL; * * default: 5 msec, units: microseconds */ +#ifdef CONFIG_PCK_INTERACTIVE +unsigned int sysctl_sched_cfs_bandwidth_slice = 3000UL; +#else unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; #endif +#endif static inline void update_load_add(struct load_weight *lw, unsigned long inc) { @@ -755,7 +780,9 @@ static void update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se) { struct task_struct *p; - u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start; + u64 delta; + + delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start; if (entity_is_task(se)) { p = task_of(se); @@ -776,22 +803,12 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se) se->statistics.wait_sum += delta; se->statistics.wait_start = 0; } -#else -static inline void -update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se) -{ -} - -static inline void -update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se) -{ -} -#endif /* * Task is being enqueued - update stats: */ -static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) +static inline void +update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) { /* * Are we enqueueing a waiting task? (for current tasks @@ -802,7 +819,7 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) } static inline void -update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) +update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { /* * Mark the end of the wait period if dequeueing a @@ -810,8 +827,41 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) */ if (se != cfs_rq->curr) update_stats_wait_end(cfs_rq, se); + + if (flags & DEQUEUE_SLEEP) { + if (entity_is_task(se)) { + struct task_struct *tsk = task_of(se); + + if (tsk->state & TASK_INTERRUPTIBLE) + se->statistics.sleep_start = rq_clock(rq_of(cfs_rq)); + if (tsk->state & TASK_UNINTERRUPTIBLE) + se->statistics.block_start = rq_clock(rq_of(cfs_rq)); + } + } + +} +#else +static inline void +update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ } +static inline void +update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ +} + +static inline void +update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ +} + +static inline void +update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) +{ +} +#endif + /* * We are picking a new current task - update its stats: */ @@ -907,10 +957,11 @@ struct numa_group { spinlock_t lock; /* nr_tasks, tasks */ int nr_tasks; pid_t gid; + int active_nodes; struct rcu_head rcu; - nodemask_t active_nodes; unsigned long total_faults; + unsigned long max_faults_cpu; /* * Faults_cpu is used to decide whether memory should move * towards the CPU. As a consequence, these stats are weighted @@ -969,6 +1020,18 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid) group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 1)]; } +/* + * A node triggering more than 1/3 as many NUMA faults as the maximum is + * considered part of a numa group's pseudo-interleaving set. Migrations + * between these nodes are slowed down, to allow things to settle down. + */ +#define ACTIVE_NODE_FRACTION 3 + +static bool numa_is_active_node(int nid, struct numa_group *ng) +{ + return group_faults_cpu(ng, nid) * ACTIVE_NODE_FRACTION > ng->max_faults_cpu; +} + /* Handle placement on systems where not all nodes are directly connected. */ static unsigned long score_nearby_nodes(struct task_struct *p, int nid, int maxdist, bool task) @@ -1118,27 +1181,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, return true; /* - * Do not migrate if the destination is not a node that - * is actively used by this numa group. - */ - if (!node_isset(dst_nid, ng->active_nodes)) - return false; - - /* - * Source is a node that is not actively used by this - * numa group, while the destination is. Migrate. + * Destination node is much more heavily used than the source + * node? Allow migration. */ - if (!node_isset(src_nid, ng->active_nodes)) + if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) * + ACTIVE_NODE_FRACTION) return true; /* - * Both source and destination are nodes in active - * use by this numa group. Maximize memory bandwidth - * by migrating from more heavily used groups, to less - * heavily used ones, spreading the load around. - * Use a 1/4 hysteresis to avoid spurious page movement. + * Distribute memory according to CPU & memory use on each node, + * with 3/4 hysteresis to avoid unnecessary memory migrations: + * + * faults_cpu(dst) 3 faults_cpu(src) + * --------------- * - > --------------- + * faults_mem(dst) 4 faults_mem(src) */ - return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4); + return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 > + group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4; } static unsigned long weighted_cpuload(const int cpu); @@ -1484,7 +1543,7 @@ static int task_numa_migrate(struct task_struct *p) .best_task = NULL, .best_imp = 0, - .best_cpu = -1 + .best_cpu = -1, }; struct sched_domain *sd; unsigned long taskweight, groupweight; @@ -1536,8 +1595,7 @@ static int task_numa_migrate(struct task_struct *p) * multiple NUMA nodes; in order to better consolidate the group, * we need to check other locations. */ - if (env.best_cpu == -1 || (p->numa_group && - nodes_weight(p->numa_group->active_nodes) > 1)) { + if (env.best_cpu == -1 || (p->numa_group && p->numa_group->active_nodes > 1)) { for_each_online_node(nid) { if (nid == env.src_nid || nid == p->numa_preferred_nid) continue; @@ -1572,12 +1630,14 @@ static int task_numa_migrate(struct task_struct *p) * trying for a better one later. Do not set the preferred node here. */ if (p->numa_group) { + struct numa_group *ng = p->numa_group; + if (env.best_cpu == -1) nid = env.src_nid; else nid = env.dst_nid; - if (node_isset(nid, p->numa_group->active_nodes)) + if (ng->active_nodes > 1 && numa_is_active_node(env.dst_nid, ng)) sched_setnuma(p, env.dst_nid); } @@ -1627,20 +1687,15 @@ static void numa_migrate_preferred(struct task_struct *p) } /* - * Find the nodes on which the workload is actively running. We do this by + * Find out how many nodes on the workload is actively running on. Do this by * tracking the nodes from which NUMA hinting faults are triggered. This can * be different from the set of nodes where the workload's memory is currently * located. - * - * The bitmask is used to make smarter decisions on when to do NUMA page - * migrations, To prevent flip-flopping, and excessive page migrations, nodes - * are added when they cause over 6/16 of the maximum number of faults, but - * only removed when they drop below 3/16. */ -static void update_numa_active_node_mask(struct numa_group *numa_group) +static void numa_group_count_active_nodes(struct numa_group *numa_group) { unsigned long faults, max_faults = 0; - int nid; + int nid, active_nodes = 0; for_each_online_node(nid) { faults = group_faults_cpu(numa_group, nid); @@ -1650,12 +1705,12 @@ static void update_numa_active_node_mask(struct numa_group *numa_group) for_each_online_node(nid) { faults = group_faults_cpu(numa_group, nid); - if (!node_isset(nid, numa_group->active_nodes)) { - if (faults > max_faults * 6 / 16) - node_set(nid, numa_group->active_nodes); - } else if (faults < max_faults * 3 / 16) - node_clear(nid, numa_group->active_nodes); + if (faults * ACTIVE_NODE_FRACTION > max_faults) + active_nodes++; } + + numa_group->max_faults_cpu = max_faults; + numa_group->active_nodes = active_nodes; } /* @@ -1946,7 +2001,7 @@ static void task_numa_placement(struct task_struct *p) update_task_scan_period(p, fault_types[0], fault_types[1]); if (p->numa_group) { - update_numa_active_node_mask(p->numa_group); + numa_group_count_active_nodes(p->numa_group); spin_unlock_irq(group_lock); max_nid = preferred_group_nid(p, max_group_nid); } @@ -1990,14 +2045,14 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, return; atomic_set(&grp->refcount, 1); + grp->active_nodes = 1; + grp->max_faults_cpu = 0; spin_lock_init(&grp->lock); grp->gid = p->pid; /* Second half of the array tracks nids where faults happen */ grp->faults_cpu = grp->faults + NR_NUMA_HINT_FAULT_TYPES * nr_node_ids; - node_set(task_node(current), grp->active_nodes); - for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) grp->faults[i] = p->numa_faults[i]; @@ -2111,6 +2166,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) bool migrated = flags & TNF_MIGRATED; int cpu_node = task_node(current); int local = !!(flags & TNF_FAULT_LOCAL); + struct numa_group *ng; int priv; if (!static_branch_likely(&sched_numa_balancing)) @@ -2151,9 +2207,10 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) * actively using should be counted as local. This allows the * scan rate to slow down when a workload has settled down. */ - if (!priv && !local && p->numa_group && - node_isset(cpu_node, p->numa_group->active_nodes) && - node_isset(mem_node, p->numa_group->active_nodes)) + ng = p->numa_group; + if (!priv && !local && ng && ng->active_nodes > 1 && + numa_is_active_node(cpu_node, ng) && + numa_is_active_node(mem_node, ng)) local = 1; task_numa_placement(p); @@ -3102,6 +3159,26 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) static void check_enqueue_throttle(struct cfs_rq *cfs_rq); +static inline void check_schedstat_required(void) +{ +#ifdef CONFIG_SCHEDSTATS + if (schedstat_enabled()) + return; + + /* Force schedstat enabled if a dependent tracepoint is active */ + if (trace_sched_stat_wait_enabled() || + trace_sched_stat_sleep_enabled() || + trace_sched_stat_iowait_enabled() || + trace_sched_stat_blocked_enabled() || + trace_sched_stat_runtime_enabled()) { + pr_warn_once("Scheduler tracepoints stat_sleep, stat_iowait, " + "stat_blocked and stat_runtime require the " + "kernel parameter schedstats=enabled or " + "kernel.sched_schedstats=1\n"); + } +#endif +} + static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { @@ -3122,11 +3199,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (flags & ENQUEUE_WAKEUP) { place_entity(cfs_rq, se, 0); - enqueue_sleeper(cfs_rq, se); + if (schedstat_enabled()) + enqueue_sleeper(cfs_rq, se); } - update_stats_enqueue(cfs_rq, se); - check_spread(cfs_rq, se); + check_schedstat_required(); + if (schedstat_enabled()) { + update_stats_enqueue(cfs_rq, se); + check_spread(cfs_rq, se); + } if (se != cfs_rq->curr) __enqueue_entity(cfs_rq, se); se->on_rq = 1; @@ -3193,19 +3274,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) update_curr(cfs_rq); dequeue_entity_load_avg(cfs_rq, se); - update_stats_dequeue(cfs_rq, se); - if (flags & DEQUEUE_SLEEP) { -#ifdef CONFIG_SCHEDSTATS - if (entity_is_task(se)) { - struct task_struct *tsk = task_of(se); - - if (tsk->state & TASK_INTERRUPTIBLE) - se->statistics.sleep_start = rq_clock(rq_of(cfs_rq)); - if (tsk->state & TASK_UNINTERRUPTIBLE) - se->statistics.block_start = rq_clock(rq_of(cfs_rq)); - } -#endif - } + if (schedstat_enabled()) + update_stats_dequeue(cfs_rq, se, flags); clear_buddies(cfs_rq, se); @@ -3279,7 +3349,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) * a CPU. So account for the time it spent waiting on the * runqueue. */ - update_stats_wait_end(cfs_rq, se); + if (schedstat_enabled()) + update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); update_load_avg(se, 1); } @@ -3292,7 +3363,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) * least twice that of our own weight (i.e. dont track it * when there are only lesser-weight tasks around): */ - if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) { + if (schedstat_enabled() && rq_of(cfs_rq)->load.weight >= 2*se->load.weight) { se->statistics.slice_max = max(se->statistics.slice_max, se->sum_exec_runtime - se->prev_sum_exec_runtime); } @@ -3375,9 +3446,13 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) /* throttle cfs_rqs exceeding runtime */ check_cfs_rq_runtime(cfs_rq); - check_spread(cfs_rq, prev); + if (schedstat_enabled()) { + check_spread(cfs_rq, prev); + if (prev->on_rq) + update_stats_wait_start(cfs_rq, prev); + } + if (prev->on_rq) { - update_stats_wait_start(cfs_rq, prev); /* Put 'current' back into the tree. */ __enqueue_entity(cfs_rq, prev); /* in !on_rq case, update occurred at dequeue */ @@ -4459,9 +4534,17 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, /* scale is effectively 1 << i now, and >> i divides by scale */ - old_load = this_rq->cpu_load[i] - tickless_load; + old_load = this_rq->cpu_load[i]; old_load = decay_load_missed(old_load, pending_updates - 1, i); - old_load += tickless_load; + if (tickless_load) { + old_load -= decay_load_missed(tickless_load, pending_updates - 1, i); + /* + * old_load can never be a negative value because a + * decayed tickless_load cannot be greater than the + * original tickless_load. + */ + old_load += tickless_load; + } new_load = this_load; /* * Round up the averaging division if load is increasing. This @@ -4484,6 +4567,25 @@ static unsigned long weighted_cpuload(const int cpu) } #ifdef CONFIG_NO_HZ_COMMON +static void __update_cpu_load_nohz(struct rq *this_rq, + unsigned long curr_jiffies, + unsigned long load, + int active) +{ + unsigned long pending_updates; + + pending_updates = curr_jiffies - this_rq->last_load_update_tick; + if (pending_updates) { + this_rq->last_load_update_tick = curr_jiffies; + /* + * In the regular NOHZ case, we were idle, this means load 0. + * In the NOHZ_FULL case, we were non-idle, we should consider + * its weighted load. + */ + __update_cpu_load(this_rq, load, pending_updates, active); + } +} + /* * There is no sane way to deal with nohz on smp when using jiffies because the * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading @@ -4501,22 +4603,15 @@ static unsigned long weighted_cpuload(const int cpu) * Called from nohz_idle_balance() to update the load ratings before doing the * idle balance. */ -static void update_idle_cpu_load(struct rq *this_rq) +static void update_cpu_load_idle(struct rq *this_rq) { - unsigned long curr_jiffies = READ_ONCE(jiffies); - unsigned long load = weighted_cpuload(cpu_of(this_rq)); - unsigned long pending_updates; - /* * bail if there's load or we're actually up-to-date. */ - if (load || curr_jiffies == this_rq->last_load_update_tick) + if (weighted_cpuload(cpu_of(this_rq))) return; - pending_updates = curr_jiffies - this_rq->last_load_update_tick; - this_rq->last_load_update_tick = curr_jiffies; - - __update_cpu_load(this_rq, load, pending_updates, 0); + __update_cpu_load_nohz(this_rq, READ_ONCE(jiffies), 0, 0); } /* @@ -4527,22 +4622,12 @@ void update_cpu_load_nohz(int active) struct rq *this_rq = this_rq(); unsigned long curr_jiffies = READ_ONCE(jiffies); unsigned long load = active ? weighted_cpuload(cpu_of(this_rq)) : 0; - unsigned long pending_updates; if (curr_jiffies == this_rq->last_load_update_tick) return; raw_spin_lock(&this_rq->lock); - pending_updates = curr_jiffies - this_rq->last_load_update_tick; - if (pending_updates) { - this_rq->last_load_update_tick = curr_jiffies; - /* - * In the regular NOHZ case, we were idle, this means load 0. - * In the NOHZ_FULL case, we were non-idle, we should consider - * its weighted load. - */ - __update_cpu_load(this_rq, load, pending_updates, active); - } + __update_cpu_load_nohz(this_rq, curr_jiffies, load, active); raw_spin_unlock(&this_rq->lock); } #endif /* CONFIG_NO_HZ */ @@ -4554,7 +4639,7 @@ void update_cpu_load_active(struct rq *this_rq) { unsigned long load = weighted_cpuload(cpu_of(this_rq)); /* - * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). + * See the mess around update_cpu_load_idle() / update_cpu_load_nohz(). */ this_rq->last_load_update_tick = jiffies; __update_cpu_load(this_rq, load, 1, 1); @@ -7848,7 +7933,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) if (time_after_eq(jiffies, rq->next_balance)) { raw_spin_lock_irq(&rq->lock); update_rq_clock(rq); - update_idle_cpu_load(rq); + update_cpu_load_idle(rq); raw_spin_unlock_irq(&rq->lock); rebalance_domains(rq, CPU_IDLE); } @@ -8234,11 +8319,8 @@ void free_fair_sched_group(struct task_group *tg) for_each_possible_cpu(i) { if (tg->cfs_rq) kfree(tg->cfs_rq[i]); - if (tg->se) { - if (tg->se[i]) - remove_entity_load_avg(tg->se[i]); + if (tg->se) kfree(tg->se[i]); - } } kfree(tg->cfs_rq); @@ -8286,21 +8368,29 @@ err: return 0; } -void unregister_fair_sched_group(struct task_group *tg, int cpu) +void unregister_fair_sched_group(struct task_group *tg) { - struct rq *rq = cpu_rq(cpu); unsigned long flags; + struct rq *rq; + int cpu; - /* - * Only empty task groups can be destroyed; so we can speculatively - * check on_list without danger of it being re-added. - */ - if (!tg->cfs_rq[cpu]->on_list) - return; + for_each_possible_cpu(cpu) { + if (tg->se[cpu]) + remove_entity_load_avg(tg->se[cpu]); - raw_spin_lock_irqsave(&rq->lock, flags); - list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); - raw_spin_unlock_irqrestore(&rq->lock, flags); + /* + * Only empty task groups can be destroyed; so we can speculatively + * check on_list without danger of it being re-added. + */ + if (!tg->cfs_rq[cpu]->on_list) + continue; + + rq = cpu_rq(cpu); + + raw_spin_lock_irqsave(&rq->lock, flags); + list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); + raw_spin_unlock_irqrestore(&rq->lock, flags); + } } void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, @@ -8382,7 +8472,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) return 1; } -void unregister_fair_sched_group(struct task_group *tg, int cpu) { } +void unregister_fair_sched_group(struct task_group *tg) { } #endif /* CONFIG_FAIR_GROUP_SCHED */ diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 8ec86ab..a774b4d 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -58,7 +58,15 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b) raw_spin_lock(&rt_b->rt_runtime_lock); if (!rt_b->rt_period_active) { rt_b->rt_period_active = 1; - hrtimer_forward_now(&rt_b->rt_period_timer, rt_b->rt_period); + /* + * SCHED_DEADLINE updates the bandwidth, as a run away + * RT task with a DL task could hog a CPU. But DL does + * not reset the period. If a deadline task was running + * without an RT task running, it can cause RT tasks to + * throttle when they start up. Kick the timer right away + * to update the period. + */ + hrtimer_forward_now(&rt_b->rt_period_timer, ns_to_ktime(0)); hrtimer_start_expires(&rt_b->rt_period_timer, HRTIMER_MODE_ABS_PINNED); } raw_spin_unlock(&rt_b->rt_runtime_lock); @@ -436,7 +444,7 @@ static void dequeue_top_rt_rq(struct rt_rq *rt_rq); static inline int on_rt_rq(struct sched_rt_entity *rt_se) { - return !list_empty(&rt_se->run_list); + return rt_se->on_rq; } #ifdef CONFIG_RT_GROUP_SCHED @@ -482,8 +490,8 @@ static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se) return rt_se->my_q; } -static void enqueue_rt_entity(struct sched_rt_entity *rt_se, bool head); -static void dequeue_rt_entity(struct sched_rt_entity *rt_se); +static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags); +static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags); static void sched_rt_rq_enqueue(struct rt_rq *rt_rq) { @@ -499,7 +507,7 @@ static void sched_rt_rq_enqueue(struct rt_rq *rt_rq) if (!rt_se) enqueue_top_rt_rq(rt_rq); else if (!on_rt_rq(rt_se)) - enqueue_rt_entity(rt_se, false); + enqueue_rt_entity(rt_se, 0); if (rt_rq->highest_prio.curr < curr->prio) resched_curr(rq); @@ -516,7 +524,7 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq) if (!rt_se) dequeue_top_rt_rq(rt_rq); else if (on_rt_rq(rt_se)) - dequeue_rt_entity(rt_se); + dequeue_rt_entity(rt_se, 0); } static inline int rt_rq_throttled(struct rt_rq *rt_rq) @@ -1166,7 +1174,30 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) dec_rt_group(rt_se, rt_rq); } -static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, bool head) +/* + * Change rt_se->run_list location unless SAVE && !MOVE + * + * assumes ENQUEUE/DEQUEUE flags match + */ +static inline bool move_entity(unsigned int flags) +{ + if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE) + return false; + + return true; +} + +static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_array *array) +{ + list_del_init(&rt_se->run_list); + + if (list_empty(array->queue + rt_se_prio(rt_se))) + __clear_bit(rt_se_prio(rt_se), array->bitmap); + + rt_se->on_list = 0; +} + +static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags) { struct rt_rq *rt_rq = rt_rq_of_se(rt_se); struct rt_prio_array *array = &rt_rq->active; @@ -1179,26 +1210,37 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, bool head) * get throttled and the current group doesn't have any other * active members. */ - if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)) + if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)) { + if (rt_se->on_list) + __delist_rt_entity(rt_se, array); return; + } - if (head) - list_add(&rt_se->run_list, queue); - else - list_add_tail(&rt_se->run_list, queue); - __set_bit(rt_se_prio(rt_se), array->bitmap); + if (move_entity(flags)) { + WARN_ON_ONCE(rt_se->on_list); + if (flags & ENQUEUE_HEAD) + list_add(&rt_se->run_list, queue); + else + list_add_tail(&rt_se->run_list, queue); + + __set_bit(rt_se_prio(rt_se), array->bitmap); + rt_se->on_list = 1; + } + rt_se->on_rq = 1; inc_rt_tasks(rt_se, rt_rq); } -static void __dequeue_rt_entity(struct sched_rt_entity *rt_se) +static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags) { struct rt_rq *rt_rq = rt_rq_of_se(rt_se); struct rt_prio_array *array = &rt_rq->active; - list_del_init(&rt_se->run_list); - if (list_empty(array->queue + rt_se_prio(rt_se))) - __clear_bit(rt_se_prio(rt_se), array->bitmap); + if (move_entity(flags)) { + WARN_ON_ONCE(!rt_se->on_list); + __delist_rt_entity(rt_se, array); + } + rt_se->on_rq = 0; dec_rt_tasks(rt_se, rt_rq); } @@ -1207,7 +1249,7 @@ static void __dequeue_rt_entity(struct sched_rt_entity *rt_se) * Because the prio of an upper entry depends on the lower * entries, we must remove entries top - down. */ -static void dequeue_rt_stack(struct sched_rt_entity *rt_se) +static void dequeue_rt_stack(struct sched_rt_entity *rt_se, unsigned int flags) { struct sched_rt_entity *back = NULL; @@ -1220,31 +1262,31 @@ static void dequeue_rt_stack(struct sched_rt_entity *rt_se) for (rt_se = back; rt_se; rt_se = rt_se->back) { if (on_rt_rq(rt_se)) - __dequeue_rt_entity(rt_se); + __dequeue_rt_entity(rt_se, flags); } } -static void enqueue_rt_entity(struct sched_rt_entity *rt_se, bool head) +static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags) { struct rq *rq = rq_of_rt_se(rt_se); - dequeue_rt_stack(rt_se); + dequeue_rt_stack(rt_se, flags); for_each_sched_rt_entity(rt_se) - __enqueue_rt_entity(rt_se, head); + __enqueue_rt_entity(rt_se, flags); enqueue_top_rt_rq(&rq->rt); } -static void dequeue_rt_entity(struct sched_rt_entity *rt_se) +static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags) { struct rq *rq = rq_of_rt_se(rt_se); - dequeue_rt_stack(rt_se); + dequeue_rt_stack(rt_se, flags); for_each_sched_rt_entity(rt_se) { struct rt_rq *rt_rq = group_rt_rq(rt_se); if (rt_rq && rt_rq->rt_nr_running) - __enqueue_rt_entity(rt_se, false); + __enqueue_rt_entity(rt_se, flags); } enqueue_top_rt_rq(&rq->rt); } @@ -1260,7 +1302,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags) if (flags & ENQUEUE_WAKEUP) rt_se->timeout = 0; - enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD); + enqueue_rt_entity(rt_se, flags); if (!task_current(rq, p) && p->nr_cpus_allowed > 1) enqueue_pushable_task(rq, p); @@ -1271,7 +1313,7 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) struct sched_rt_entity *rt_se = &p->rt; update_curr_rt(rq); - dequeue_rt_entity(rt_se); + dequeue_rt_entity(rt_se, flags); dequeue_pushable_task(rq, p); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 10f1637..ef5875f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3,6 +3,7 @@ #include #include #include +#include #include #include #include @@ -313,12 +314,11 @@ extern int tg_nop(struct task_group *tg, void *data); extern void free_fair_sched_group(struct task_group *tg); extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent); -extern void unregister_fair_sched_group(struct task_group *tg, int cpu); +extern void unregister_fair_sched_group(struct task_group *tg); extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, struct sched_entity *se, int cpu, struct sched_entity *parent); extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b); -extern int sched_group_set_shares(struct task_group *tg, unsigned long shares); extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b); extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b); @@ -909,6 +909,18 @@ static inline unsigned int group_first_cpu(struct sched_group *group) extern int group_balance_cpu(struct sched_group *sg); +#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL) +void register_sched_domain_sysctl(void); +void unregister_sched_domain_sysctl(void); +#else +static inline void register_sched_domain_sysctl(void) +{ +} +static inline void unregister_sched_domain_sysctl(void) +{ +} +#endif + #else static inline void sched_ttwu_pending(void) { } @@ -1022,6 +1034,7 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR]; #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */ extern struct static_key_false sched_numa_balancing; +extern struct static_key_false sched_schedstats; static inline u64 global_rt_period(void) { @@ -1130,18 +1143,40 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) extern const int sched_prio_to_weight[40]; extern const u32 sched_prio_to_wmult[40]; +/* + * {de,en}queue flags: + * + * DEQUEUE_SLEEP - task is no longer runnable + * ENQUEUE_WAKEUP - task just became runnable + * + * SAVE/RESTORE - an otherwise spurious dequeue/enqueue, done to ensure tasks + * are in a known state which allows modification. Such pairs + * should preserve as much state as possible. + * + * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location + * in the runqueue. + * + * ENQUEUE_HEAD - place at front of runqueue (tail if not specified) + * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline) + * ENQUEUE_WAKING - sched_class::task_waking was called + * + */ + +#define DEQUEUE_SLEEP 0x01 +#define DEQUEUE_SAVE 0x02 /* matches ENQUEUE_RESTORE */ +#define DEQUEUE_MOVE 0x04 /* matches ENQUEUE_MOVE */ + #define ENQUEUE_WAKEUP 0x01 -#define ENQUEUE_HEAD 0x02 +#define ENQUEUE_RESTORE 0x02 +#define ENQUEUE_MOVE 0x04 + +#define ENQUEUE_HEAD 0x08 +#define ENQUEUE_REPLENISH 0x10 #ifdef CONFIG_SMP -#define ENQUEUE_WAKING 0x04 /* sched_class::task_waking was called */ +#define ENQUEUE_WAKING 0x20 #else #define ENQUEUE_WAKING 0x00 #endif -#define ENQUEUE_REPLENISH 0x08 -#define ENQUEUE_RESTORE 0x10 - -#define DEQUEUE_SLEEP 0x01 -#define DEQUEUE_SAVE 0x02 #define RETRY_TASK ((void *)-1UL) diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index b0fbc76..70b3b6a 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -29,9 +29,10 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta) if (rq) rq->rq_sched_info.run_delay += delta; } -# define schedstat_inc(rq, field) do { (rq)->field++; } while (0) -# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0) -# define schedstat_set(var, val) do { var = (val); } while (0) +# define schedstat_enabled() static_branch_unlikely(&sched_schedstats) +# define schedstat_inc(rq, field) do { if (schedstat_enabled()) { (rq)->field++; } } while (0) +# define schedstat_add(rq, field, amt) do { if (schedstat_enabled()) { (rq)->field += (amt); } } while (0) +# define schedstat_set(var, val) do { if (schedstat_enabled()) { var = (val); } } while (0) #else /* !CONFIG_SCHEDSTATS */ static inline void rq_sched_info_arrive(struct rq *rq, unsigned long long delta) @@ -42,6 +43,7 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta) static inline void rq_sched_info_depart(struct rq *rq, unsigned long long delta) {} +# define schedstat_enabled() 0 # define schedstat_inc(rq, field) do { } while (0) # define schedstat_add(rq, field, amt) do { } while (0) # define schedstat_set(var, val) do { } while (0) diff --git b/kernel/sched/swait.c b/kernel/sched/swait.c new file mode 100644 index 0000000..82f0dff --- /dev/null +++ b/kernel/sched/swait.c @@ -0,0 +1,123 @@ +#include +#include + +void __init_swait_queue_head(struct swait_queue_head *q, const char *name, + struct lock_class_key *key) +{ + raw_spin_lock_init(&q->lock); + lockdep_set_class_and_name(&q->lock, key, name); + INIT_LIST_HEAD(&q->task_list); +} +EXPORT_SYMBOL(__init_swait_queue_head); + +/* + * The thing about the wake_up_state() return value; I think we can ignore it. + * + * If for some reason it would return 0, that means the previously waiting + * task is already running, so it will observe condition true (or has already). + */ +void swake_up_locked(struct swait_queue_head *q) +{ + struct swait_queue *curr; + + if (list_empty(&q->task_list)) + return; + + curr = list_first_entry(&q->task_list, typeof(*curr), task_list); + wake_up_process(curr->task); + list_del_init(&curr->task_list); +} +EXPORT_SYMBOL(swake_up_locked); + +void swake_up(struct swait_queue_head *q) +{ + unsigned long flags; + + if (!swait_active(q)) + return; + + raw_spin_lock_irqsave(&q->lock, flags); + swake_up_locked(q); + raw_spin_unlock_irqrestore(&q->lock, flags); +} +EXPORT_SYMBOL(swake_up); + +/* + * Does not allow usage from IRQ disabled, since we must be able to + * release IRQs to guarantee bounded hold time. + */ +void swake_up_all(struct swait_queue_head *q) +{ + struct swait_queue *curr; + LIST_HEAD(tmp); + + if (!swait_active(q)) + return; + + raw_spin_lock_irq(&q->lock); + list_splice_init(&q->task_list, &tmp); + while (!list_empty(&tmp)) { + curr = list_first_entry(&tmp, typeof(*curr), task_list); + + wake_up_state(curr->task, TASK_NORMAL); + list_del_init(&curr->task_list); + + if (list_empty(&tmp)) + break; + + raw_spin_unlock_irq(&q->lock); + raw_spin_lock_irq(&q->lock); + } + raw_spin_unlock_irq(&q->lock); +} +EXPORT_SYMBOL(swake_up_all); + +void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait) +{ + wait->task = current; + if (list_empty(&wait->task_list)) + list_add(&wait->task_list, &q->task_list); +} + +void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state) +{ + unsigned long flags; + + raw_spin_lock_irqsave(&q->lock, flags); + __prepare_to_swait(q, wait); + set_current_state(state); + raw_spin_unlock_irqrestore(&q->lock, flags); +} +EXPORT_SYMBOL(prepare_to_swait); + +long prepare_to_swait_event(struct swait_queue_head *q, struct swait_queue *wait, int state) +{ + if (signal_pending_state(state, current)) + return -ERESTARTSYS; + + prepare_to_swait(q, wait, state); + + return 0; +} +EXPORT_SYMBOL(prepare_to_swait_event); + +void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait) +{ + __set_current_state(TASK_RUNNING); + if (!list_empty(&wait->task_list)) + list_del_init(&wait->task_list); +} + +void finish_swait(struct swait_queue_head *q, struct swait_queue *wait) +{ + unsigned long flags; + + __set_current_state(TASK_RUNNING); + + if (!list_empty_careful(&wait->task_list)) { + raw_spin_lock_irqsave(&q->lock, flags); + list_del_init(&wait->task_list); + raw_spin_unlock_irqrestore(&q->lock, flags); + } +} +EXPORT_SYMBOL(finish_swait); diff --git a/kernel/smpboot.c b/kernel/smpboot.c index d264f59..28c8e73 100644 --- a/kernel/smpboot.c +++ b/kernel/smpboot.c @@ -174,7 +174,7 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu) if (tsk) return 0; - td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu)); + td = kzalloc_node(sizeof(*td), GFP_KERNEL | ___GFP_TOI_NOTRACK, cpu_to_node(cpu)); if (!td) return -ENOMEM; td->cpu = cpu; diff --git a/kernel/softirq.c b/kernel/softirq.c index 479e443..8aae49d 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -116,9 +116,9 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt) if (preempt_count() == cnt) { #ifdef CONFIG_DEBUG_PREEMPT - current->preempt_disable_ip = get_parent_ip(CALLER_ADDR1); + current->preempt_disable_ip = get_lock_parent_ip(); #endif - trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1)); + trace_preempt_off(CALLER_ADDR0, get_lock_parent_ip()); } } EXPORT_SYMBOL(__local_bh_disable_ip); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 97715fd..f5102fa 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -350,6 +350,17 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_SCHEDSTATS + { + .procname = "sched_schedstats", + .data = NULL, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = sysctl_schedstats, + .extra1 = &zero, + .extra2 = &one, + }, +#endif /* CONFIG_SCHEDSTATS */ #endif /* CONFIG_SMP */ #ifdef CONFIG_NUMA_BALANCING { @@ -505,7 +516,7 @@ static struct ctl_table kern_table[] = { .data = &latencytop_enabled, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = sysctl_latencytop, }, #endif #ifdef CONFIG_BLK_DEV_INITRD diff --git a/kernel/tsacct.c b/kernel/tsacct.c index 975cb49..f8e26ab 100644 --- a/kernel/tsacct.c +++ b/kernel/tsacct.c @@ -93,9 +93,11 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p) { struct mm_struct *mm; - /* convert pages-usec to Mbyte-usec */ - stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB; - stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB; + /* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */ + stats->coremem = p->acct_rss_mem1 * PAGE_SIZE; + do_div(stats->coremem, 1000 * KB); + stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE; + do_div(stats->virtmem, 1000 * KB); mm = get_task_mm(p); if (mm) { /* adjust to KB unit */ @@ -123,27 +125,28 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p) static void __acct_update_integrals(struct task_struct *tsk, cputime_t utime, cputime_t stime) { - if (likely(tsk->mm)) { - cputime_t time, dtime; - struct timeval value; - unsigned long flags; - u64 delta; - - local_irq_save(flags); - time = stime + utime; - dtime = time - tsk->acct_timexpd; - jiffies_to_timeval(cputime_to_jiffies(dtime), &value); - delta = value.tv_sec; - delta = delta * USEC_PER_SEC + value.tv_usec; - - if (delta == 0) - goto out; - tsk->acct_timexpd = time; - tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm); - tsk->acct_vm_mem1 += delta * tsk->mm->total_vm; - out: - local_irq_restore(flags); - } + cputime_t time, dtime; + u64 delta; + + if (!likely(tsk->mm)) + return; + + time = stime + utime; + dtime = time - tsk->acct_timexpd; + /* Avoid division: cputime_t is often in nanoseconds already. */ + delta = cputime_to_nsecs(dtime); + + if (delta < TICK_NSEC) + return; + + tsk->acct_timexpd = time; + /* + * Divide by 1024 to avoid overflow, and to avoid division. + * The final unit reported to userspace is Mbyte-usecs, + * the rest of the math is done in xacct_add_tsk. + */ + tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10; + tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10; } /** @@ -153,9 +156,12 @@ static void __acct_update_integrals(struct task_struct *tsk, void acct_update_integrals(struct task_struct *tsk) { cputime_t utime, stime; + unsigned long flags; + local_irq_save(flags); task_cputime(tsk, &utime, &stime); __acct_update_integrals(tsk, utime, stime); + local_irq_restore(flags); } /** diff --git a/mm/debug.c b/mm/debug.c index f05b2d5..5c6da0f 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -40,6 +40,12 @@ static const struct trace_print_flags pageflag_names[] = { #ifdef CONFIG_MEMORY_FAILURE {1UL << PG_hwpoison, "hwpoison" }, #endif +#ifdef CONFIG_TOI_INCREMENTAL + {1UL << PG_toi_untracked, "toi_untracked" }, + {1UL << PG_toi_ro, "toi_ro" }, + {1UL << PG_toi_cbw, "toi_cbw" }, + {1UL << PG_toi_dirty, "toi_dirty" }, +#endif #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) {1UL << PG_young, "young" }, {1UL << PG_idle, "idle" }, diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 6fe7d15..7d57843 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -70,7 +70,11 @@ static long ratelimit_pages = 32; /* * Start background writeback (via writeback threads) at this percentage */ +#ifdef CONFIG_PCK_INTERACTIVE +int dirty_background_ratio = 20; +#else int dirty_background_ratio = 10; +#endif /* * dirty_background_bytes starts at 0 (disabled) so that it is a function of @@ -87,7 +91,11 @@ int vm_highmem_is_dirtyable; /* * The generator of dirty data starts writeback at this percentage */ +#ifdef CONFIG_PCK_INTERACTIVE +int vm_dirty_ratio = 50; +#else int vm_dirty_ratio = 20; +#endif /* * vm_dirty_bytes starts at 0 (disabled) so that it is a function of diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 838ca8b..58bc482 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -62,6 +62,7 @@ #include #include #include +#include #include #include @@ -751,6 +752,12 @@ static inline int free_pages_check(struct page *page) if (unlikely(page->mem_cgroup)) bad_reason = "page still charged to cgroup"; #endif + if (unlikely(PageTOI_Untracked(page))) { + // Make it writable and included in image if allocated. + ClearPageTOI_Untracked(page); + // If it gets allocated, it will be dirty from TOI's POV. + SetPageTOI_Dirty(page); + } if (unlikely(bad_reason)) { bad_page(page, bad_reason, bad_flags); return 1; @@ -1390,6 +1397,11 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, struct page *p = page + i; if (unlikely(check_new_page(p))) return 1; + if (unlikely(toi_incremental_support() && gfp_flags & ___GFP_TOI_NOTRACK)) { + // Make the page writable if it's protected, and set it to be untracked. + SetPageTOI_Untracked(p); + toi_make_writable(init_mm.pgd, (unsigned long) page_address(p)); + } } set_page_private(page, 0); diff --git a/mm/percpu.c b/mm/percpu.c index 998607a..2f040d0 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -125,6 +125,7 @@ static int pcpu_nr_units __read_mostly; static int pcpu_atom_size __read_mostly; static int pcpu_nr_slots __read_mostly; static size_t pcpu_chunk_struct_size __read_mostly; +static int pcpu_pfns; /* cpus with the lowest and highest unit addresses */ static unsigned int pcpu_low_unit_cpu __read_mostly; @@ -1790,6 +1791,7 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info( /* calculate size_sum and ensure dyn_size is enough for early alloc */ size_sum = PFN_ALIGN(static_size + reserved_size + max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE)); + pcpu_pfns = PFN_DOWN(size_sum); dyn_size = size_sum - static_size - reserved_size; /* @@ -2277,6 +2279,22 @@ void __init percpu_init_late(void) } } +#ifdef CONFIG_TOI_INCREMENTAL +/* + * It doesn't matter if we mark an extra page as untracked (and therefore + * always save it in incremental images). + */ +void toi_mark_per_cpus_pages_untracked(void) +{ + int i; + + struct page *page = virt_to_page(pcpu_base_addr); + + for (i = 0; i < pcpu_pfns; i++) + SetPageTOI_Untracked(page + i); +} +#endif + /* * Percpu allocator is initialized early during boot when neither slab or * workqueue is available. Plug async management until everything is up diff --git a/mm/shmem.c b/mm/shmem.c index 440e2a7..f64ab5f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1500,7 +1500,7 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) } static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, - umode_t mode, dev_t dev, unsigned long flags) + umode_t mode, dev_t dev, unsigned long flags, int atomic_copy) { struct inode *inode; struct shmem_inode_info *info; @@ -1521,6 +1521,8 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode spin_lock_init(&info->lock); info->seals = F_SEAL_SEAL; info->flags = flags & VM_NORESERVE; + if (atomic_copy) + inode->i_flags |= S_ATOMIC_COPY; INIT_LIST_HEAD(&info->swaplist); simple_xattrs_init(&info->xattrs); cache_no_acl(inode); @@ -2310,7 +2312,7 @@ shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) struct inode *inode; int error = -ENOSPC; - inode = shmem_get_inode(dir->i_sb, dir, mode, dev, VM_NORESERVE); + inode = shmem_get_inode(dir->i_sb, dir, mode, dev, VM_NORESERVE, 0); if (inode) { error = simple_acl_create(dir, inode); if (error) @@ -2339,7 +2341,7 @@ shmem_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) struct inode *inode; int error = -ENOSPC; - inode = shmem_get_inode(dir->i_sb, dir, mode, 0, VM_NORESERVE); + inode = shmem_get_inode(dir->i_sb, dir, mode, 0, VM_NORESERVE, 0); if (inode) { error = security_inode_init_security(inode, dir, NULL, @@ -2531,7 +2533,7 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s if (len > PAGE_CACHE_SIZE) return -ENAMETOOLONG; - inode = shmem_get_inode(dir->i_sb, dir, S_IFLNK|S_IRWXUGO, 0, VM_NORESERVE); + inode = shmem_get_inode(dir->i_sb, dir, S_IFLNK|S_IRWXUGO, 0, VM_NORESERVE, 0); if (!inode) return -ENOSPC; @@ -3010,7 +3012,7 @@ SYSCALL_DEFINE2(memfd_create, goto err_name; } - file = shmem_file_setup(name, 0, VM_NORESERVE); + file = shmem_file_setup(name, 0, VM_NORESERVE, 0); if (IS_ERR(file)) { error = PTR_ERR(file); goto err_fd; @@ -3101,7 +3103,7 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags |= MS_POSIXACL; #endif - inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE); + inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE, 0); if (!inode) goto failed; inode->i_uid = sbinfo->uid; @@ -3357,7 +3359,7 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range); #define shmem_vm_ops generic_file_vm_ops #define shmem_file_operations ramfs_file_operations -#define shmem_get_inode(sb, dir, mode, dev, flags) ramfs_get_inode(sb, dir, mode, dev) +#define shmem_get_inode(sb, dir, mode, dev, flags, atomic_copy) ramfs_get_inode(sb, dir, mode, dev) #define shmem_acct_size(flags, size) 0 #define shmem_unacct_size(flags, size) do {} while (0) @@ -3370,7 +3372,8 @@ static struct dentry_operations anon_ops = { }; static struct file *__shmem_file_setup(const char *name, loff_t size, - unsigned long flags, unsigned int i_flags) + unsigned long flags, unsigned int i_flags, + int atomic_copy) { struct file *res; struct inode *inode; @@ -3399,7 +3402,7 @@ static struct file *__shmem_file_setup(const char *name, loff_t size, d_set_d_op(path.dentry, &anon_ops); res = ERR_PTR(-ENOSPC); - inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags); + inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags, atomic_copy); if (!inode) goto put_memory; @@ -3435,9 +3438,9 @@ put_path: * @size: size to be set for the file * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size */ -struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned long flags) +struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned long flags, int atomic_copy) { - return __shmem_file_setup(name, size, flags, S_PRIVATE); + return __shmem_file_setup(name, size, flags, S_PRIVATE, atomic_copy); } /** @@ -3446,9 +3449,9 @@ struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned lon * @size: size to be set for the file * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size */ -struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags) +struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags, int atomic_copy) { - return __shmem_file_setup(name, size, flags, 0); + return __shmem_file_setup(name, size, flags, 0, atomic_copy); } EXPORT_SYMBOL_GPL(shmem_file_setup); @@ -3467,7 +3470,7 @@ int shmem_zero_setup(struct vm_area_struct *vma) * accessible to the user through its mapping, use S_PRIVATE flag to * bypass file security, in the same way as shmem_kernel_file_setup(). */ - file = __shmem_file_setup("dev/zero", size, vma->vm_flags, S_PRIVATE); + file = __shmem_file_setup("dev/zero", size, vma->vm_flags, S_PRIVATE, 0); if (IS_ERR(file)) return PTR_ERR(file); diff --git a/mm/slub.c b/mm/slub.c index d8fbd4a..5328e64 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1379,7 +1379,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s, struct page *page; int order = oo_order(oo); - flags |= __GFP_NOTRACK; + flags |= (__GFP_NOTRACK | ___GFP_TOI_NOTRACK); if (node == NUMA_NO_NODE) page = alloc_pages(flags, order); @@ -3543,7 +3543,7 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node) struct page *page; void *ptr = NULL; - flags |= __GFP_COMP | __GFP_NOTRACK; + flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_TOI_NOTRACK; page = alloc_kmem_pages_node(node, flags, get_order(size)); if (page) ptr = page_address(page); diff --git a/mm/swapfile.c b/mm/swapfile.c index d2c3736..98d3483 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -43,7 +44,6 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); -static sector_t map_swap_entry(swp_entry_t, struct block_device**); DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -719,6 +719,60 @@ swp_entry_t get_swap_page_of_type(int type) return (swp_entry_t) {0}; } +static unsigned int find_next_to_unuse(struct swap_info_struct *si, + unsigned int prev, bool frontswap); + +void get_swap_range_of_type(int type, swp_entry_t *start, swp_entry_t *end, + unsigned int limit) +{ + struct swap_info_struct *si; + pgoff_t start_at; + unsigned int i; + + *start = swp_entry(0, 0); + *end = swp_entry(0, 0); + si = swap_info[type]; + spin_lock(&si->lock); + if (si && (si->flags & SWP_WRITEOK)) { + atomic_long_dec(&nr_swap_pages); + /* This is called for allocating swap entry, not cache */ + start_at = scan_swap_map(si, 1); + if (start_at) { + unsigned long stop_at = find_next_to_unuse(si, start_at, 0); + if (stop_at > start_at) + stop_at--; + else + stop_at = si->max - 1; + if (stop_at - start_at + 1 > limit) + stop_at = min_t(unsigned int, + start_at + limit - 1, + si->max - 1); + /* Mark them used */ + for (i = start_at; i <= stop_at; i++) + si->swap_map[i] = 1; + /* first page already done above */ + si->inuse_pages += stop_at - start_at; + + atomic_long_sub(stop_at - start_at, &nr_swap_pages); + if (start_at == si->lowest_bit) + si->lowest_bit = stop_at + 1; + if (stop_at == si->highest_bit) + si->highest_bit = start_at - 1; + if (si->inuse_pages == si->pages) { + si->lowest_bit = si->max; + si->highest_bit = 0; + } + for (i = start_at + 1; i <= stop_at; i++) + inc_cluster_info_page(si, si->cluster_info, i); + si->cluster_next = stop_at + 1; + *start = swp_entry(type, start_at); + *end = swp_entry(type, stop_at); + } else + atomic_long_inc(&nr_swap_pages); + } + spin_unlock(&si->lock); +} + static struct swap_info_struct *swap_info_get(swp_entry_t entry) { struct swap_info_struct *p; @@ -1607,7 +1661,7 @@ static void drain_mmlist(void) * Note that the type of this function is sector_t, but it returns page offset * into the bdev, not sector offset. */ -static sector_t map_swap_entry(swp_entry_t entry, struct block_device **bdev) +sector_t map_swap_entry(swp_entry_t entry, struct block_device **bdev) { struct swap_info_struct *sis; struct swap_extent *start_se; @@ -2738,8 +2792,14 @@ pgoff_t __page_file_index(struct page *page) VM_BUG_ON_PAGE(!PageSwapCache(page), page); return swp_offset(swap); } + EXPORT_SYMBOL_GPL(__page_file_index); +struct swap_info_struct *get_swap_info_struct(unsigned type) +{ + return swap_info[type]; +} + /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's diff --git a/mm/vmscan.c b/mm/vmscan.c index 71b1c29..f406a47 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1475,7 +1475,7 @@ static int too_many_isolated(struct zone *zone, int file, { unsigned long inactive, isolated; - if (current_is_kswapd()) + if (current_is_kswapd() || sc->hibernation_mode) return 0; if (!sane_reclaim(sc)) @@ -2345,6 +2345,9 @@ static inline bool should_continue_reclaim(struct zone *zone, unsigned long pages_for_compaction; unsigned long inactive_lru_pages; + if (nr_reclaimed && nr_scanned && sc->nr_to_reclaim >= sc->nr_reclaimed) + return true; + /* If not in reclaim/compaction mode, stop */ if (!in_reclaim_compaction(sc)) return false; @@ -2659,6 +2662,12 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, unsigned long total_scanned = 0; unsigned long writeback_threshold; bool zones_reclaimable; + +#ifdef CONFIG_FREEZER + if (unlikely(pm_freezing && !sc->hibernation_mode)) + return 0; +#endif + retry: delayacct_freepages_start(); @@ -3540,6 +3549,11 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) if (!populated_zone(zone)) return; +#ifdef CONFIG_FREEZER + if (pm_freezing) + return; +#endif + if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL)) return; pgdat = zone->zone_pgdat; @@ -3565,7 +3579,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) * LRU order by reclaiming preferentially * inactive > active > active referenced > active mapped */ -unsigned long shrink_all_memory(unsigned long nr_to_reclaim) +unsigned long shrink_memory_mask(unsigned long nr_to_reclaim, gfp_t mask) { struct reclaim_state reclaim_state; struct scan_control sc = { @@ -3594,6 +3608,11 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) return nr_reclaimed; } + +unsigned long shrink_all_memory(unsigned long nr_to_reclaim) +{ + return shrink_memory_mask(nr_to_reclaim, GFP_HIGHUSER_MOVABLE); +} #endif /* CONFIG_HIBERNATION */ /* It's optimal to keep kswapds on the same CPUs as their memory, but diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c index fd3ce46..2d49cee 100644 --- a/net/core/secure_seq.c +++ b/net/core/secure_seq.c @@ -8,7 +8,11 @@ #include #include #include +#include +#include +#include +#include #include #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET) @@ -39,6 +43,102 @@ static u32 seq_scale(u32 seq) } #endif +#ifdef CONFIG_TCP_STEALTH +u32 tcp_stealth_sequence_number(struct sock *sk, __be32 *daddr, + u32 daddr_size, __be16 dport) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct tcp_md5sig_key *md5; + + __u32 sec[MD5_MESSAGE_BYTES / sizeof(__u32)]; + __u32 i; + __u32 tsval = 0; + + __be32 iv[MD5_DIGEST_WORDS] = { 0 }; + __be32 isn; + + memcpy(iv, daddr, (daddr_size > sizeof(iv)) ? sizeof(iv) : daddr_size); + +#ifdef CONFIG_TCP_MD5SIG + md5 = tp->af_specific->md5_lookup(sk, sk); +#else + md5 = NULL; +#endif + if (likely(sysctl_tcp_timestamps && !md5) || tp->stealth.saw_tsval) + tsval = tp->stealth.mstamp.stamp_jiffies; + + ((__be16 *)iv)[2] ^= cpu_to_be16(tp->stealth.integrity_hash); + iv[2] ^= cpu_to_be32(tsval); + ((__be16 *)iv)[6] ^= dport; + + for (i = 0; i < MD5_DIGEST_WORDS; i++) + iv[i] = le32_to_cpu(iv[i]); + for (i = 0; i < MD5_MESSAGE_BYTES / sizeof(__le32); i++) + sec[i] = le32_to_cpu(((__le32 *)tp->stealth.secret)[i]); + + md5_transform(iv, sec); + + isn = cpu_to_be32(iv[0]) ^ cpu_to_be32(iv[1]) ^ + cpu_to_be32(iv[2]) ^ cpu_to_be32(iv[3]); + + if (tp->stealth.mode & TCP_STEALTH_MODE_INTEGRITY) + be32_isn_to_be16_ih(isn) = + cpu_to_be16(tp->stealth.integrity_hash); + + return be32_to_cpu(isn); +} +EXPORT_SYMBOL(tcp_stealth_sequence_number); + +u32 tcp_stealth_do_auth(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct tcphdr *th = tcp_hdr(skb); + __be32 isn = th->seq; + __be32 hash; + __be32 *daddr; + u32 daddr_size; + + tp->stealth.saw_tsval = + tcp_parse_tsval_option(&tp->stealth.mstamp.stamp_jiffies, th); + + if (tp->stealth.mode & TCP_STEALTH_MODE_INTEGRITY_LEN) + tp->stealth.integrity_hash = + be16_to_cpu(be32_isn_to_be16_ih(isn)); + + switch (tp->inet_conn.icsk_inet.sk.sk_family) { +#if IS_ENABLED(CONFIG_IPV6) + case PF_INET6: + daddr_size = sizeof(ipv6_hdr(skb)->daddr.s6_addr32); + daddr = ipv6_hdr(skb)->daddr.s6_addr32; + break; +#endif + case PF_INET: + daddr_size = sizeof(ip_hdr(skb)->daddr); + daddr = &ip_hdr(skb)->daddr; + break; + default: + pr_err("TCP Stealth: Unknown network layer protocol, stop!\n"); + return 1; + } + + hash = tcp_stealth_sequence_number(sk, daddr, daddr_size, th->dest); + cpu_to_be32s(&hash); + + if (tp->stealth.mode & TCP_STEALTH_MODE_AUTH && + tp->stealth.mode & TCP_STEALTH_MODE_INTEGRITY_LEN && + be32_isn_to_be16_av(isn) == be32_isn_to_be16_av(hash)) + return 0; + + if (tp->stealth.mode & TCP_STEALTH_MODE_AUTH && + !(tp->stealth.mode & TCP_STEALTH_MODE_INTEGRITY_LEN) && + isn == hash) + return 0; + + return 1; +} +EXPORT_SYMBOL(tcp_stealth_do_auth); +#endif + #if IS_ENABLED(CONFIG_IPV6) __u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr, __be16 sport, __be16 dport) diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 7758247..7ccd693 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -653,6 +653,9 @@ choice config DEFAULT_VEGAS bool "Vegas" if TCP_CONG_VEGAS=y + config DEFAULT_YEAH + bool "YeAH" if TCP_CONG_YEAH=y + config DEFAULT_VENO bool "Veno" if TCP_CONG_VENO=y @@ -683,6 +686,7 @@ config DEFAULT_TCP_CONG default "htcp" if DEFAULT_HTCP default "hybla" if DEFAULT_HYBLA default "vegas" if DEFAULT_VEGAS + default "yeah" if DEFAULT_YEAH default "westwood" if DEFAULT_WESTWOOD default "veno" if DEFAULT_VENO default "reno" if DEFAULT_RENO @@ -700,3 +704,13 @@ config TCP_MD5SIG on the Internet. If unsure, say N. + +config TCP_STEALTH + bool "TCP: Stealth TCP socket support" + default n + ---help--- + This option enables support for stealth TCP sockets. If you do not + know what this means, you do not need it. + + If unsure, say N. + diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 483ffdf..0da50fe 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -269,6 +269,7 @@ #include #include #include +#include #include #include @@ -2318,6 +2319,43 @@ static int tcp_repair_options_est(struct tcp_sock *tp, return 0; } +#ifdef CONFIG_TCP_STEALTH +int tcp_stealth_integrity(__be16 *hash, u8 *secret, u8 *payload, int len) +{ + struct scatterlist sg[2]; + struct crypto_hash *tfm; + struct hash_desc desc; + __be16 h[MD5_DIGEST_WORDS * 2]; + int i; + int err = 0; + + tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC); + if (IS_ERR(tfm)) { + err = -PTR_ERR(tfm); + goto out; + } + desc.tfm = tfm; + desc.flags = 0; + + sg_init_table(sg, 2); + sg_set_buf(&sg[0], secret, MD5_MESSAGE_BYTES); + sg_set_buf(&sg[1], payload, len); + + if (crypto_hash_digest(&desc, sg, MD5_MESSAGE_BYTES + len, (u8 *)h)) { + err = -EFAULT; + goto out; + } + + *hash = be16_to_cpu(h[0]); + for (i = 1; i < MD5_DIGEST_WORDS * 2; i++) + *hash ^= be16_to_cpu(h[i]); + +out: + crypto_free_hash(tfm); + return err; +} +#endif + /* * Socket option code for TCP. */ @@ -2348,6 +2386,66 @@ static int do_tcp_setsockopt(struct sock *sk, int level, release_sock(sk); return err; } +#ifdef CONFIG_TCP_STEALTH + case TCP_STEALTH: { + u8 secret[MD5_MESSAGE_BYTES] = { 0 }; + + val = copy_from_user(secret, optval, + min_t(unsigned int, optlen, + MD5_MESSAGE_BYTES)); + if (val != 0) + return -EFAULT; + + lock_sock(sk); + memcpy(tp->stealth.secret, secret, MD5_MESSAGE_BYTES); + tp->stealth.mode = TCP_STEALTH_MODE_AUTH; + tp->stealth.mstamp.v64 = 0; + tp->stealth.saw_tsval = false; + release_sock(sk); + return err; + } + case TCP_STEALTH_INTEGRITY: { + u8 *payload; + + lock_sock(sk); + + if (!(tp->stealth.mode & TCP_STEALTH_MODE_AUTH)) { + err = -EOPNOTSUPP; + goto stealth_integrity_out_1; + } + + if (optlen < 1 || optlen > USHRT_MAX) { + err = -EINVAL; + goto stealth_integrity_out_1; + } + + payload = vmalloc(optlen); + if (!payload) { + err = -ENOMEM; + goto stealth_integrity_out_1; + } + + val = copy_from_user(payload, optval, optlen); + if (val != 0) { + err = -EFAULT; + goto stealth_integrity_out_2; + } + + err = tcp_stealth_integrity(&tp->stealth.integrity_hash, + tp->stealth.secret, payload, + optlen); + if (err) + goto stealth_integrity_out_2; + + tp->stealth.mode |= TCP_STEALTH_MODE_INTEGRITY; + +stealth_integrity_out_2: + vfree(payload); +stealth_integrity_out_1: + release_sock(sk); + return err; + } +#endif default: /* fallthru */ break; @@ -2599,6 +2697,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level, tp->notsent_lowat = val; sk->sk_write_space(sk); break; +#ifdef CONFIG_TCP_STEALTH + case TCP_STEALTH_INTEGRITY_LEN: + if (!(tp->stealth.mode & TCP_STEALTH_MODE_AUTH)) { + err = -EOPNOTSUPP; + } else if (val < 1 || val > USHRT_MAX) { + err = -EINVAL; + } else { + tp->stealth.integrity_len = val; + tp->stealth.mode |= TCP_STEALTH_MODE_INTEGRITY_LEN; + } + break; +#endif default: err = -ENOPROTOOPT; break; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3b2c8e9..b3a47bc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -77,6 +77,9 @@ #include int sysctl_tcp_timestamps __read_mostly = 1; +#ifdef CONFIG_TCP_STEALTH +EXPORT_SYMBOL(sysctl_tcp_timestamps); +#endif int sysctl_tcp_window_scaling __read_mostly = 1; int sysctl_tcp_sack __read_mostly = 1; int sysctl_tcp_fack __read_mostly = 1; @@ -3850,6 +3853,47 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb, return true; } +#ifdef CONFIG_TCP_STEALTH +/* Parse only the TSVal field of the TCP Timestamp option header. + */ +const bool tcp_parse_tsval_option(u32 *tsval, const struct tcphdr *th) +{ + int length = (th->doff << 2) - sizeof(*th); + const u8 *ptr = (const u8 *)(th + 1); + + /* If the TCP option is too short, we can short cut */ + if (length < TCPOLEN_TIMESTAMP) + return false; + + while (length > 0) { + int opcode = *ptr++; + int opsize; + + switch (opcode) { + case TCPOPT_EOL: + return false; + case TCPOPT_NOP: + length--; + continue; + case TCPOPT_TIMESTAMP: + opsize = *ptr++; + if (opsize != TCPOLEN_TIMESTAMP || opsize > length) + return false; + *tsval = get_unaligned_be32(ptr); + return true; + default: + opsize = *ptr++; + if (opsize < 2 || opsize > length) + return false; + } + ptr += opsize - 2; + length -= opsize; + } + return false; +} +EXPORT_SYMBOL(tcp_parse_tsval_option); +#endif + #ifdef CONFIG_TCP_MD5SIG /* * Parse MD5 Signature option @@ -4532,6 +4576,31 @@ err: } +#ifdef CONFIG_TCP_STEALTH +static int __tcp_stealth_integrity_check(struct sock *sk, struct sk_buff *skb) +{ + struct tcphdr *th = tcp_hdr(skb); + struct tcp_sock *tp = tcp_sk(sk); + u16 hash; + __be32 seq = cpu_to_be32(TCP_SKB_CB(skb)->seq - 1); + char *data = skb->data + th->doff * 4; + int len = skb->len - th->doff * 4; + + if (len < tp->stealth.integrity_len) + return 1; + + if (tcp_stealth_integrity(&hash, tp->stealth.secret, data, + tp->stealth.integrity_len)) + return 1; + + if (be32_isn_to_be16_ih(seq) != cpu_to_be16(hash)) + return 1; + + tp->stealth.mode &= ~TCP_STEALTH_MODE_INTEGRITY_LEN; + return 0; +} +#endif + static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); @@ -4541,6 +4610,14 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) goto drop; +#ifdef CONFIG_TCP_STEALTH + if (unlikely(tp->stealth.mode & TCP_STEALTH_MODE_INTEGRITY_LEN) && + __tcp_stealth_integrity_check(sk, skb)) { + tcp_reset(sk); + goto drop; + } +#endif + skb_dst_drop(skb); __skb_pull(skb, tcp_hdr(skb)->doff * 4); @@ -5313,6 +5390,15 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb, int eaten = 0; bool fragstolen = false; +#ifdef CONFIG_TCP_STEALTH + if (unlikely(tp->stealth.mode & + TCP_STEALTH_MODE_INTEGRITY_LEN) && + __tcp_stealth_integrity_check(sk, skb)) { + tcp_reset(sk); + goto discard; + } +#endif + if (tp->ucopy.task == current && tp->copied_seq == tp->rcv_nxt && len - tcp_header_len <= tp->ucopy.len && diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 487ac67..6dded95 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -74,6 +74,7 @@ #include #include #include +#include #include #include @@ -234,6 +235,21 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) sk->sk_gso_type = SKB_GSO_TCPV4; sk_setup_caps(sk, &rt->dst); +#ifdef CONFIG_TCP_STEALTH + /* If CONFIG_TCP_STEALTH is defined, we need to know the timestamp as + * early as possible and thus move taking the snapshot of tcp_time_stamp + * here. + */ + skb_mstamp_get(&tp->stealth.mstamp); + + if (!tp->write_seq && likely(!tp->repair) && + unlikely(tp->stealth.mode & TCP_STEALTH_MODE_AUTH)) + tp->write_seq = tcp_stealth_sequence_number(sk, + &inet->inet_daddr, + sizeof(inet->inet_daddr), + usin->sin_port); +#endif + if (!tp->write_seq && likely(!tp->repair)) tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr, inet->inet_daddr, @@ -1378,6 +1394,8 @@ static struct sock *tcp_v4_cookie_check(struct sock *sk, struct sk_buff *skb) */ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) { + struct tcp_sock *tp = tcp_sk(sk); + struct tcphdr *th = tcp_hdr(skb); struct sock *rsk; if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ @@ -1399,6 +1417,15 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) if (tcp_checksum_complete(skb)) goto csum_err; +#ifdef CONFIG_TCP_STEALTH + if (sk->sk_state == TCP_LISTEN && th->syn && !th->fin && + unlikely(tp->stealth.mode & TCP_STEALTH_MODE_AUTH) && + tcp_stealth_do_auth(sk, skb)) { + rsk = sk; + goto reset; + } +#endif + if (sk->sk_state == TCP_LISTEN) { struct sock *nsk = tcp_v4_cookie_check(sk, skb); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index fda379c..20f2812 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -931,6 +931,13 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, tcb = TCP_SKB_CB(skb); memset(&opts, 0, sizeof(opts)); +#ifdef TCP_STEALTH + if (unlikely(tcb->tcp_flags & TCPHDR_SYN && + tp->stealth.mode & TCP_STEALTH_MODE_AUTH)) { + skb->skb_mstamp = tp->stealth.mstamp; + } +#endif + if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5); else @@ -3258,7 +3265,15 @@ int tcp_connect(struct sock *sk) return -ENOBUFS; tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN); +#ifdef CONFIG_TCP_STEALTH + /* The timetamp was already made at the time the ISN was generated + * as we need to know its value in the stealth_tcp_sequence_number() + * function. + */ + tp->retrans_stamp = tp->stealth.mstamp.stamp_jiffies; +#else tp->retrans_stamp = tcp_time_stamp; +#endif tcp_connect_queue_skb(sk, buff); tcp_ecn_send_syn(sk, buff); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 5c8c842..62e167e 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -62,6 +62,7 @@ #include #include #include +#include #include #include @@ -278,6 +279,21 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, sk_set_txhash(sk); +#ifdef CONFIG_TCP_STEALTH + /* If CONFIG_TCP_STEALTH is defined, we need to know the timestamp as + * early as possible and thus move taking the snapshot of tcp_time_stamp + * here. + */ + skb_mstamp_get(&tp->stealth.mstamp); + + if (!tp->write_seq && likely(!tp->repair) && + unlikely(tp->stealth.mode & TCP_STEALTH_MODE_AUTH)) + tp->write_seq = tcp_stealth_sequence_number(sk, + sk->sk_v6_daddr.s6_addr32, + sizeof(sk->sk_v6_daddr), + inet->inet_dport); +#endif + if (!tp->write_seq && likely(!tp->repair)) tp->write_seq = secure_tcpv6_sequence_number(np->saddr.s6_addr32, sk->sk_v6_daddr.s6_addr32, @@ -1183,7 +1199,8 @@ out: static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) { struct ipv6_pinfo *np = inet6_sk(sk); - struct tcp_sock *tp; + struct tcp_sock *tp = tcp_sk(sk); + struct tcphdr *th = tcp_hdr(skb); struct sk_buff *opt_skb = NULL; /* Imagine: socket is IPv6. IPv4 packet arrives, @@ -1243,6 +1260,13 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) if (tcp_checksum_complete(skb)) goto csum_err; +#ifdef CONFIG_TCP_STEALTH + if (sk->sk_state == TCP_LISTEN && th->syn && !th->fin && + tp->stealth.mode & TCP_STEALTH_MODE_AUTH && + tcp_stealth_do_auth(sk, skb)) + goto reset; +#endif + if (sk->sk_state == TCP_LISTEN) { struct sock *nsk = tcp_v6_cookie_check(sk, skb); diff --git a/scripts/mkcompile_h b/scripts/mkcompile_h index 6fdc97e..20739e4 100755 --- a/scripts/mkcompile_h +++ b/scripts/mkcompile_h @@ -54,8 +54,8 @@ else fi UTS_VERSION="#$VERSION" -CONFIG_FLAGS="" -if [ -n "$SMP" ] ; then CONFIG_FLAGS="SMP"; fi +CONFIG_FLAGS="PCK" +if [ -n "$SMP" ] ; then CONFIG_FLAGS="$CONFIG_FLAGS SMP"; fi if [ -n "$PREEMPT" ] ; then CONFIG_FLAGS="$CONFIG_FLAGS PREEMPT"; fi UTS_VERSION="$UTS_VERSION $CONFIG_FLAGS $TIMESTAMP" diff --git b/scripts/tuxonice_output_to_csv.sh b/scripts/tuxonice_output_to_csv.sh new file mode 100644 index 0000000..b96e680 --- /dev/null +++ b/scripts/tuxonice_output_to_csv.sh @@ -0,0 +1,35 @@ +#!/bin/bash + +cat $1 | grep "\*TOI\*" | cut -b 22- | sed "s/ /,/g" | sed "s/\.//" | sort -n > $1.tmp +COLUMNS=$(cat $1.tmp | awk -F ',' ' { print $2 } ' | sort | uniq) +echo -n "pfn," > $1.tmp2 +for NAME in $COLUMNS; do + echo -n "$NAME," >> $1.tmp2 +done +echo >> $1.tmp2 +FIRST=1 +declare -A data +while IFS=, read -r pfn column value; do + if [ $FIRST -eq 1 ]; then + FIRST=0 + LAST_PFN=$pfn + fi + if [ $pfn -ne $LAST_PFN ]; then + echo -n "$LAST_PFN," >> $1.tmp2; + for NAME in $COLUMNS; do + echo -n "${data[$NAME]}," >> $1.tmp2 + done + data=( ) + echo >> $1.tmp2 + LAST_PFN=$pfn + fi + if [ -z "$value" ]; then + data[$column]=X + else + data[$column]=$value + fi +done < $1.tmp +mv $1.tmp2 $1.csv +rm $1.tmp +LIBREOFFICE=$(which libreoffice) +[ -n "$LIBREOFFICE" ] && libreoffice $1.csv & diff --git a/security/keys/big_key.c b/security/keys/big_key.c index 907c152..83dce8a 100644 --- a/security/keys/big_key.c +++ b/security/keys/big_key.c @@ -78,7 +78,7 @@ int big_key_preparse(struct key_preparsed_payload *prep) * * TODO: Encrypt the stored data with a temporary key. */ - file = shmem_kernel_file_setup("", datalen, 0); + file = shmem_kernel_file_setup("", datalen, 0, 0); if (IS_ERR(file)) { ret = PTR_ERR(file); goto error; diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index db2dd33..65da997 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -97,8 +97,8 @@ static void async_pf_execute(struct work_struct *work) * This memory barrier pairs with prepare_to_wait's set_current_state() */ smp_mb(); - if (waitqueue_active(&vcpu->wq)) - wake_up_interruptible(&vcpu->wq); + if (swait_active(&vcpu->wq)) + swake_up(&vcpu->wq); mmput(mm); kvm_put_kvm(vcpu->kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9102ae1..5af50c3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -216,8 +216,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) vcpu->kvm = kvm; vcpu->vcpu_id = id; vcpu->pid = NULL; - vcpu->halt_poll_ns = 0; - init_waitqueue_head(&vcpu->wq); + init_swait_queue_head(&vcpu->wq); kvm_async_pf_vcpu_init(vcpu); vcpu->pre_pcpu = -1; @@ -1993,7 +1992,7 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu) void kvm_vcpu_block(struct kvm_vcpu *vcpu) { ktime_t start, cur; - DEFINE_WAIT(wait); + DECLARE_SWAITQUEUE(wait); bool waited = false; u64 block_ns; @@ -2018,7 +2017,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) kvm_arch_vcpu_blocking(vcpu); for (;;) { - prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); + prepare_to_swait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); if (kvm_vcpu_check_block(vcpu) < 0) break; @@ -2027,7 +2026,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) schedule(); } - finish_wait(&vcpu->wq, &wait); + finish_swait(&vcpu->wq, &wait); cur = ktime_get(); kvm_arch_vcpu_unblocking(vcpu); @@ -2059,11 +2058,11 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu) { int me; int cpu = vcpu->cpu; - wait_queue_head_t *wqp; + struct swait_queue_head *wqp; wqp = kvm_arch_vcpu_wq(vcpu); - if (waitqueue_active(wqp)) { - wake_up_interruptible(wqp); + if (swait_active(wqp)) { + swake_up(wqp); ++vcpu->stat.halt_wakeup; } @@ -2164,7 +2163,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) continue; if (vcpu == me) continue; - if (waitqueue_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu)) + if (swait_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu)) continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) continue;