Age | Commit message (Collapse) | Author |
|
Revert commit 69a37bea (cpuidle: Quickly notice prediction failure for
repeat mode), because it has been identified as the source of a
significant performance regression in v3.8 and later as explained by
Jeremy Eder:
We believe we've identified a particular commit to the cpuidle code
that seems to be impacting performance of variety of workloads.
The simplest way to reproduce is using netperf TCP_RR test, so
we're using that, on a pair of Sandy Bridge based servers. We also
have data from a large database setup where performance is also
measurably/positively impacted, though that test data isn't easily
share-able.
Included below are test results from 3 test kernels:
kernel reverts
-----------------------------------------------------------
1) vanilla upstream (no reverts)
2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c
3) test reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
e11538d1f03914eb92af5a1a378375c05ae8520c
In summary, netperf TCP_RR numbers improve by approximately 4%
after reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. When
69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency
never seems to get above 40%. Taking that patch out gets C0 near
100% quite often, and performance increases.
The below data are histograms representing the %c0 residency @
1-second sample rates (using turbostat), while under netperf test.
- If you look at the first 4 histograms, you can see %c0 residency
almost entirely in the 30,40% bin.
- The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
shows %c0 in the 80,90,100% bins.
Below each kernel name are netperf TCP_RR trans/s numbers for the
particular kernel that can be disclosed publicly, comparing the 3
test kernels. We ran a 4th test with the vanilla kernel where
we've also set /dev/cpu_dma_latency=0 to show overall impact
boosting single-threaded TCP_RR performance over 11% above
baseline.
3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):
TCP_RR trans/s 54323.78
-----------------------------------------------------------
3.10-rc2 vanilla RX (no reverts)
TCP_RR trans/s 48192.47
Receiver %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 59]:
***********************************************************
40.0000 - 50.0000 [ 1]: *
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:
Sender %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 11]: ***********
40.0000 - 50.0000 [ 49]:
*************************************************
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:
-----------------------------------------------------------
3.10-rc2 perfteam2 RX (reverts commit
e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 49698.69
Receiver %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 1]: *
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 59]:
***********************************************************
40.0000 - 50.0000 [ 0]:
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:
Sender %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 2]: **
40.0000 - 50.0000 [ 58]:
**********************************************************
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:
-----------------------------------------------------------
3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
and e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 47766.95
Receiver %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 1]: *
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 27]: ***************************
40.0000 - 50.0000 [ 2]: **
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 2]: **
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 28]: ****************************
Sender:
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 11]: ***********
40.0000 - 50.0000 [ 0]:
50.0000 - 60.0000 [ 1]: *
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 3]: ***
80.0000 - 90.0000 [ 7]: *******
90.0000 - 100.0000 [ 38]: **************************************
These results demonstrate gaining back the tendency of the CPU to
stay in more responsive, performant C-states (and thus yield
measurably better performance), by reverting commit
69a37beabf1f0a6705c08e879bdd5d82ff6486c4.
Requested-by: Jeremy Eder <jeder@redhat.com>
Tested-by: Len Brown <len.brown@intel.com>
Cc: 3.8+ <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Revert commit e11538d1 (cpuidle: Quickly notice prediction failure in
general case), since it depends on commit 69a37be (cpuidle: Quickly
notice prediction failure for repeat mode) that has been identified
as the source of a significant performance regression in v3.8 and
later.
Requested-by: Jeremy Eder <jeder@redhat.com>
Tested-by: Len Brown <len.brown@intel.com>
Cc: 3.8+ <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI updates from Rafael Wysocki:
"This time the total number of ACPI commits is slightly greater than
the number of cpufreq commits, but Viresh Kumar (who works on cpufreq)
remains the most active patch submitter.
To me, the most significant change is the addition of offline/online
device operations to the driver core (with the Greg's blessing) and
the related modifications of the ACPI core hotplug code. Next are the
freezer updates from Colin Cross that should make the freezing of
tasks a bit less heavy weight.
We also have a couple of regression fixes, a number of fixes for
issues that have not been identified as regressions, two new drivers
and a bunch of cleanups all over.
Highlights:
- Hotplug changes to support graceful hot-removal failures.
It sometimes is necessary to fail device hot-removal operations
gracefully if they cannot be carried out completely. For example,
if memory from a memory module being hot-removed has been allocated
for the kernel's own use and cannot be moved elsewhere, it's
desirable to fail the hot-removal operation in a graceful way
rather than to crash the kernel, but currenty a success or a kernel
crash are the only possible outcomes of an attempted memory
hot-removal. Needless to say, that is not a very attractive
alternative and it had to be addressed.
However, in order to make it work for memory, I first had to make
it work for CPUs and for this purpose I needed to modify the ACPI
processor driver. It's been split into two parts, a resident one
handling the low-level initialization/cleanup and a modular one
playing the actual driver's role (but it binds to the CPU system
device objects rather than to the ACPI device objects representing
processors). That's been sort of like a live brain surgery on a
patient who's riding a bike.
So this is a little scary, but since we found and fixed a couple of
regressions it caused to happen during the early linux-next testing
(a month ago), nobody has complained.
As a bonus we remove some duplicated ACPI hotplug code, because the
ACPI-based CPU hotplug is now going to use the common ACPI hotplug
code.
- Lighter weight freezing of tasks.
These changes from Colin Cross and Mandeep Singh Baines are
targeted at making the freezing of tasks a bit less heavy weight
operation. They reduce the number of tasks woken up every time
during the freezing, by using the observation that the freezer
simply doesn't need to wake up some of them and wait for them all
to call refrigerator(). The time needed for the freezer to decide
to report a failure is reduced too.
Also reintroduced is the check causing a lockdep warining to
trigger when try_to_freeze() is called with locks held (which is
generally unsafe and shouldn't happen).
- cpufreq updates
First off, a commit from Srivatsa S Bhat fixes a resume regression
introduced during the 3.10 cycle causing some cpufreq sysfs
attributes to return wrong values to user space after resume. The
fix is kind of fresh, but also it's pretty obvious once Srivatsa
has identified the root cause.
Second, we have a new freqdomain_cpus sysfs attribute for the
acpi-cpufreq driver to provide information previously available via
related_cpus. From Lan Tianyu.
Finally, we fix a number of issues, mostly related to the
CPUFREQ_POSTCHANGE notifier and cpufreq Kconfig options and clean
up some code. The majority of changes from Viresh Kumar with bits
from Jacob Shin, Heiko Stübner, Xiaoguang Chen, Ezequiel Garcia,
Arnd Bergmann, and Tang Yuantian.
- ACPICA update
A usual bunch of updates from the ACPICA upstream.
During the 3.4 cycle we introduced support for ACPI 5 extended
sleep registers, but they are only supposed to be used if the
HW-reduced mode bit is set in the FADT flags and the code attempted
to use them without checking that bit. That caused suspend/resume
regressions to happen on some systems. Fix from Lv Zheng causes
those registers to be used only if the HW-reduced mode bit is set.
Apart from this some other ACPICA bugs are fixed and code cleanups
are made by Bob Moore, Tomasz Nowicki, Lv Zheng, Chao Guan, and
Zhang Rui.
- cpuidle updates
New driver for Xilinx Zynq processors is added by Michal Simek.
Multidriver support simplification, addition of some missing
kerneldoc comments and Kconfig-related fixes come from Daniel
Lezcano.
- ACPI power management updates
Changes to make suspend/resume work correctly in Xen guests from
Konrad Rzeszutek Wilk, sparse warning fix from Fengguang Wu and
cleanups and fixes of the ACPI device power state selection
routine.
- ACPI documentation updates
Some previously missing pieces of ACPI documentation are added by
Lv Zheng and Aaron Lu (hopefully, that will help people to
uderstand how the ACPI subsystem works) and one outdated doc is
updated by Hanjun Guo.
- Assorted ACPI updates
We finally nailed down the IA-64 issue that was the reason for
reverting commit 9f29ab11ddbf ("ACPI / scan: do not match drivers
against objects having scan handlers"), so we can fix it and move
the ACPI scan handler check added to the ACPI video driver back to
the core.
A mechanism for adding CMOS RTC address space handlers is
introduced by Lan Tianyu to allow some EC-related breakage to be
fixed on some systems.
A spec-compliant implementation of acpi_os_get_timer() is added by
Mika Westerberg.
The evaluation of _STA is added to do_acpi_find_child() to avoid
situations in which a pointer to a disabled device object is
returned instead of an enabled one with the same _ADR value. From
Jeff Wu.
Intel BayTrail PCH (Platform Controller Hub) support is added to
the ACPI driver for Intel Low-Power Subsystems (LPSS) and that
driver is modified to work around a couple of known BIOS issues.
Changes from Mika Westerberg and Heikki Krogerus.
The EC driver is fixed by Vasiliy Kulikov to use get_user() and
put_user() instead of dereferencing user space pointers blindly.
Code cleanups are made by Bjorn Helgaas, Nicholas Mazzuca and Toshi
Kani.
- Assorted power management updates
The "runtime idle" helper routine is changed to take the return
values of the callbacks executed by it into account and to call
rpm_suspend() if they return 0, which allows us to reduce the
overall code bloat a bit (by dropping some code that's not
necessary any more after that modification).
The runtime PM documentation is updated by Alan Stern (to reflect
the "runtime idle" behavior change).
New trace points for PM QoS are added by Sahara
(<keun-o.park@windriver.com>).
PM QoS documentation is updated by Lan Tianyu.
Code cleanups are made and minor issues are addressed by Bernie
Thompson, Bjorn Helgaas, Julius Werner, and Shuah Khan.
- devfreq updates
New driver for the Exynos5-bus device from Abhilash Kesavan.
Minor cleanups, fixes and MAINTAINERS update from MyungJoo Ham,
Abhilash Kesavan, Paul Bolle, Rajagopal Venkat, and Wei Yongjun.
- OMAP power management updates
Adaptive Voltage Scaling (AVS) SmartReflex voltage control driver
updates from Andrii Tseglytskyi and Nishanth Menon."
* tag 'pm+acpi-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (162 commits)
cpufreq: Fix cpufreq regression after suspend/resume
ACPI / PM: Fix possible NULL pointer deref in acpi_pm_device_sleep_state()
PM / Sleep: Warn about system time after resume with pm_trace
cpufreq: don't leave stale policy pointer in cdbs->cur_policy
acpi-cpufreq: Add new sysfs attribute freqdomain_cpus
cpufreq: make sure frequency transitions are serialized
ACPI: implement acpi_os_get_timer() according the spec
ACPI / EC: Add HP Folio 13 to ec_dmi_table in order to skip DSDT scan
ACPI: Add CMOS RTC Operation Region handler support
ACPI / processor: Drop unused variable from processor_perflib.c
cpufreq: tegra: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: s3c64xx: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: omap: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: imx6q: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: exynos: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: dbx500: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: davinci: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: arm-big-little: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: powernow-k8: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: pcc: call CPUFREQ_POSTCHANGE notfier in error cases
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC specific changes from Arnd Bergmann:
"These changes are all to SoC-specific code, a total of 33 branches on
17 platforms were pulled into this. Like last time, Renesas sh-mobile
is now the platform with the most changes, followed by OMAP and
EXYNOS.
Two new platforms, TI Keystone and Rockchips RK3xxx are added in this
branch, both containing almost no platform specific code at all, since
they are using generic subsystem interfaces for clocks, pinctrl,
interrupts etc. The device drivers are getting merged through the
respective subsystem maintainer trees.
One more SoC (u300) is now multiplatform capable and several others
(shmobile, exynos, msm, integrator, kirkwood, clps711x) are moving
towards that goal with this series but need more work.
Also noteworthy is the work on PCI here, which is traditionally part
of the SoC specific code. With the changes done by Thomas Petazzoni,
we can now more easily have PCI host controller drivers as loadable
modules and keep them separate from the platform code in
drivers/pci/host. This has already led to the discovery that three
platforms (exynos, spear and imx) are actually using an identical PCIe
host controller and will be able to share a driver once support for
spear and imx is added."
* tag 'soc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (480 commits)
ARM: integrator: let pciv3 use mem/premem from device tree
ARM: integrator: set local side PCI addresses right
ARM: dts: Add pcie controller node for exynos5440-ssdk5440
ARM: dts: Add pcie controller node for Samsung EXYNOS5440 SoC
ARM: EXYNOS: Enable PCIe support for Exynos5440
pci: Add PCIe driver for Samsung Exynos
ARM: OMAP5: voltagedomain data: remove temporary OMAP4 voltage data
ARM: keystone: Move CPU bringup code to dedicated asm file
ARM: multiplatform: always pick one CPU type
ARM: imx: select syscon for IMX6SL
ARM: keystone: select ARM_ERRATA_798181 only for SMP
ARM: imx: Synertronixx scb9328 needs to select SOC_IMX1
ARM: OMAP2+: AM43x: resolve SMP related build error
dmaengine: edma: enable build for AM33XX
ARM: edma: Add EDMA crossbar event mux support
ARM: edma: Add DT and runtime PM support to the private EDMA API
dmaengine: edma: Add TI EDMA device tree binding
arm: add basic support for Rockchip RK3066a boards
arm: add debug uarts for rockchip rk29xx and rk3xxx series
arm: Add basic clocks for Rockchip rk3066a SoCs
...
|
|
Like other ARM specific drivers, this one requires ARM_CPU_SUSPEND,
as shown by this linker error:
drivers/built-in.o: In function `calxeda_pwrdown_idle':
drivers/cpuidle/cpuidle-calxeda.c:84: undefined reference to `cpu_suspend'
drivers/cpuidle/cpuidle-calxeda.c:86: undefined reference to `cpu_resume'
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-pm@vger.kernel.org
|
|
Before commit d6f346f (cpuidle: improve governor Kconfig options),
the CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED option didn't depend on
CONFIG_CPU_IDLE but now it has been moved under the CPU_IDLE
menuconfig.
That raises the following warnings:
warning: (ARCH_OMAP4 && ARCH_TEGRA_2x_SOC) selects ARCH_NEEDS_CPU_IDLE_COUPLED
which has unmet direct dependencies (CPU_IDLE)
warning: (ARCH_OMAP4 && ARCH_TEGRA_2x_SOC) selects ARCH_NEEDS_CPU_IDLE_COUPLED
which has unmet direct dependencies (CPU_IDLE)
because the tegra2 and omap4 Kconfig files select this option
without checking if CPU_IDLE is set.
Fix that by moving ARCH_NEEDS_CPU_IDLE_COUPLED outside of CPU_IDLE.
[rjw: Changelog]
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add kerneldoc (and other) comments to the cpuidle driver's framework
code.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Commit bf4d1b5 (cpuidle: support multiple drivers) introduced support
for using multiple cpuidle drivers at the same time. It added a
couple of new APIs to register the driver per CPU, but that led to
some unnecessary code complexity related to the kernel config options
deciding whether or not the multiple driver support is enabled. The
code has to work as it did before when the multiple driver support is
not enabled and the multiple driver support has to be compatible with
the previously existing API.
Remove the new API, not used by any driver in the tree yet (but
needed for the HMP cpuidle drivers that will be submitted soon), and
add a new cpumask pointer to the cpuidle driver structure that will
point to the mask of CPUs handled by the given driver. That will
allow the cpuidle_[un]register_driver() API to be used for the
multiple driver support along with the cpuidle_[un]register()
functions added recently.
[rjw: Changelog]
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add cpuidle support for Xilinx Zynq.
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Each governor is suitable for different kernel configurations: the menu
governor suits better for a tickless system, while the ladder governor fits
better for a periodic timer tick system.
The Kconfig does not allow to [un]select a governor, thus both are compiled in
the kernel but the init order makes the menu governor to be the last one to be
registered, so becoming the default. The only way to switch back to the ladder
governor is to enable the sysfs governor switch in the kernel command line.
Because it seems nobody complained about this, the menu governor is used by
default most of the time on the system, having both governors is not really
necessary on a tickless system but there isn't a config option to disable one
or another governor.
Create a submenu for cpuidle and add a label for each governor, so we can see
the option in the menu config and enable/disable it.
The governors will be enabled depending on the CONFIG_NO_HZ option:
- If CONFIG_NO_HZ is set, then the menu governor is selected and the ladder
governor is optional, defaulting to 'yes'
- If CONFIG_NO_HZ is not set, then the ladder governor is selected and the
menu governor is optional, defaulting to 'yes'
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Move the private set_auxcr/get_auxcr functions from
drivers/cpuidle/cpuidle-calxeda.c so they can be used across platforms.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Tony Lindgren <tony@atomide.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
|
|
Currently cpuidle drivers are spread across different archs.
As a result, there are several different paths for cpuidle patch
submissions: cpuidle core changes go through linux-pm, ARM driver
changes go to the arm-soc or SoC-specific trees, sh changes go
through the sh arch tree, pseries changes go through the PowerPC tree
and finally intel changes go through the Len's tree while ACPI idle
changes go through linux-pm.
That makes it difficult to consolidate code and to propagate
modifications from the cpuidle core to the different drivers.
Hopefully, a movement has started to put the majority of cpuidle
drivers under drivers/cpuidle like cpuidle-calxeda.c and
cpuidle-kirkwood.c.
Add a maintainer entry for cpuidle to MAINTAINERS to clarify the
situation and to indicate to new cpuidle driver authors that those
drivers should not go into arch-specific directories.
The upstreaming process is unchanged: Rafael takes patches for
merging into his tree, but with an Acked-by: tag from the driver's
maintainer, so indicate in the drivers' headers who maintains them.
The arrangement will be the same as for cpufreq.
[rjw: Changelog]
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Andrew Lunn <andrew@lunn.ch> #for kirkwood
Acked-by: Jason Cooper <jason@lakedaemon.net> #for kirkwood
Acked-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Fix comment format for the kernel doc script.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Remove the duplicated code and use the cpuidle common code for initialization.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Remove the duplicated code and use the cpuidle common code for initialization.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The usual scheme to initialize a cpuidle driver on a SMP is:
cpuidle_register_driver(drv);
for_each_possible_cpu(cpu) {
device = &per_cpu(cpuidle_dev, cpu);
cpuidle_register_device(device);
}
This code is duplicated in each cpuidle driver.
On UP systems, it is done this way:
cpuidle_register_driver(drv);
device = &per_cpu(cpuidle_dev, cpu);
cpuidle_register_device(device);
On UP, the macro 'for_each_cpu' does one iteration:
#define for_each_cpu(cpu, mask) \
for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)
Hence, the initialization loop is the same for UP than SMP.
Beside, we saw different bugs / mis-initialization / return code unchecked in
the different drivers, the code is duplicated including bugs. After fixing all
these ones, it appears the initialization pattern is the same for everyone.
Please note, some drivers are doing dev->state_count = drv->state_count. This is
not necessary because it is done by the cpuidle_enable_device function in the
cpuidle framework. This is true, until you have the same states for all your
devices. Otherwise, the 'low level' API should be used instead with the specific
initialization for the driver.
Let's add a wrapper function doing this initialization with a cpumask parameter
for the coupled idle states and use it for all the drivers.
That will save a lot of LOC, consolidate the code, and the modifications in the
future could be done in a single place. Another benefit is the consolidation of
the cpuidle_device variable which is now in the cpuidle framework and no longer
spread accross the different arch specific drivers.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The en_core_tk_irqen flag is set in all the cpuidle driver which
means it is not necessary to specify this flag.
Remove the flag and the code related to it.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Kevin Hilman <khilman@linaro.org> # for mach-omap2/*
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The commit 89878baa73f0f1c679355006bd8632e5d78f96c2 introduced
the CPUIDLE_FLAG_TIMER_STOP flag where we specify a specific idle
state stops the local timer.
Now use this flag to check at init time if one state will need
the broadcast timer and, in this case, setup the broadcast timer
framework. That prevents multiple code duplication in the drivers.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Convert all uses of devm_request_and_ioremap() to the newly introduced
devm_ioremap_resource() which provides more consistent error handling.
devm_ioremap_resource() provides its own error messages so all explicit
error messages can be removed from the failure code paths.
Signed-off-by: Silviu-Mihai Popescu <silviupopescu1990@gmail.com>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
When the CPU_IDLE and the ARCH_KIRKWOOD options are set it is
pointless to define a new option CPU_IDLE_KIRKWOOD because it
is redundant.
The Makefile drivers directory contains a condition to compile
the cpuidle drivers:
obj-$(CONFIG_CPU_IDLE) += cpuidle/
Hence, if CPU_IDLE is not set we won't enter this directory.
This patch removes the useless Kconfig option and replaces the
condition in the Makefile by CONFIG_ARCH_KIRKWOOD.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Jason Cooper <jason@lakedaemon.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
When a cpu enters a deep idle state, the local timers are stopped and
the time framework falls back to the timer device used as a broadcast
timer.
The different cpuidle drivers are calling clockevents_notify ENTER/EXIT
when the idle state stops the local timer.
Add a new flag CPUIDLE_FLAG_TIMER_STOP which can be set by the cpuidle
drivers. If the flag is set, the cpuidle core code takes care of the
notification on behalf of the driver to avoid pointless code duplication.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Pull ARM SoC-specific updates from Arnd Bergmann:
"This is a larger set of new functionality for the existing SoC
families, including:
- vt8500 gains support for new CPU cores, notably the Cortex-A9 based
wm8850
- prima2 gains support for the "marco" SoC family, its SMP based
cousin
- tegra gains support for the new Tegra4 (Tegra114) family
- socfpga now supports a newer version of the hardware including SMP
- i.mx31 and bcm2835 are now using DT probing for their clocks
- lots of updates for sh-mobile
- OMAP updates for clocks, power management and USB
- i.mx6q and tegra now support cpuidle
- kirkwood now supports PCIe hot plugging
- tegra clock support is updated
- tegra USB PHY probing gets implemented diffently"
* tag 'soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (148 commits)
ARM: prima2: remove duplicate v7_invalidate_l1
ARM: shmobile: r8a7779: Correct TMU clock support again
ARM: prima2: fix __init section for cpu hotplug
ARM: OMAP: Consolidate OMAP USB-HS platform data (part 3/3)
ARM: OMAP: Consolidate OMAP USB-HS platform data (part 1/3)
arm: socfpga: Add SMP support for actual socfpga harware
arm: Add v7_invalidate_l1 to cache-v7.S
arm: socfpga: Add entries to enable make dtbs socfpga
arm: socfpga: Add new device tree source for actual socfpga HW
ARM: tegra: sort Kconfig selects for Tegra114
ARM: tegra: enable ARCH_REQUIRE_GPIOLIB for Tegra114
ARM: tegra: Fix build error w/ ARCH_TEGRA_114_SOC w/o ARCH_TEGRA_3x_SOC
ARM: tegra: Fix build error for gic update
ARM: tegra: remove empty tegra_smp_init_cpus()
ARM: shmobile: Register ARM architected timer
ARM: MARCO: fix the build issue due to gic-vic-to-irqchip move
ARM: shmobile: r8a7779: Correct TMU clock support
ARM: mxs_defconfig: Select CONFIG_DEVTMPFS_MOUNT
ARM: mxs: decrease mxs_clockevent_device.min_delta_ns to 2 clock cycles
ARM: mxs: use apbx bus clock to drive the timers on timrotv2
...
|
|
Move the Kirkwood cpuidle driver out of arch/arm/mach-kirkwood and
into drivers/cpuidle. Convert the driver into a platform driver.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jason Cooper <jason@lakedaemon.net>
|
|
The text in Documentation said it would be removed in 2.6.41;
the text in the Kconfig said removal in the 3.1 release. Either
way you look at it, we are well past both, so push it off a cliff.
Note that the POWER_CSTATE and the POWER_PSTATE are part of the
legacy tracing API. Remove all tracepoints which use these flags.
As can be seen from context, most already have a trace entry via
trace_cpu_idle anyways.
Also, the cpufreq/cpufreq.c PSTATE one is actually unpaired, as
compared to the CSTATE ones which all have a clear start/stop.
As part of this, the trace_power_frequency also becomes orphaned,
so it too is deleted.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
We realized that the power usage field is never filled and when it
is filled for tegra, the power_specified flag is not set causing all
of these values to be reset when the driver is initialized with
set_power_state().
However, the power_specified flag can be simply removed under the
assumption that the states are always backward sorted, which is the
case with the current code.
This change allows the menu governor select function and the
cpuidle_play_dead() to be simplified. Moreover, the
set_power_states() function can removed as it does not make sense
any more.
Drop the power_specified flag from struct cpuidle_driver and make
the related changes as described above.
As a consequence, this also fixes the bug where on the dynamic
C-states system, the power fields are not initialized.
[rjw: Changelog]
References: https://bugzilla.kernel.org/show_bug.cgi?id=42870
References: https://bugzilla.kernel.org/show_bug.cgi?id=43349
References: https://lkml.org/lkml/2012/10/16/518
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Commit bf4d1b5ddb78f86078ac6ae0415802d5f0c68f92 (cpuidle: support
multiple drivers) changed the number of initialized state kobjects
in cpuidle_add_state_sysfs() from device->state_count to
drv->state_count, but left device->state_count in
cpuidle_remove_state_sysfs(). The values of these two fields may be
different, in which case a NULL pointer dereference may happen in
cpuidle_remove_state_sysfs(), for example. Fix this problem by making
cpuidle_add_state_sysfs() use device->state_count too (which restores
the original behavior of it).
[rjw: Changelog]
Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Commit bf4d1b5 (cpuidle: support multiple drivers) introduced
locking in cpuidle_get_cpu_driver(), which is used in the
idle_call() function.
This leads to a contention problem with a large number of CPUs,
because they all try to run the idle routine at the same time.
The lock can be safely removed because of how is used the cpuidle
API. Namely, cpuidle_register_driver() is called first, but the
cpuidle idle function is not entered before cpuidle_register_device()
is called, because the cpuidle device is not enabled then. Moreover,
cpuidle_unregister_driver(), which would reset the driver value to
NULL, is not called before cpuidle_unregister_device().
All of the cpuidle drivers use the API in the same way.
In general, a cleanup around the lock is necessary and a proper
refcounting mechanism should be used to ensure the consistency in the
API (for example, cpuidle_unregister_driver() should fail if the
driver's refcount is not 0). However, these modifications will require
some code reorganization and rewrite which will be too intrusive for
a fix.
For this reason, fix the contention problem introduced by commit
bf4d1b5 by simply removing the locking from cpuidle_get_cpu_driver(),
which restores the original behavior of that routine.
[rjw: Changelog.]
Reported-and-tested-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The ready_waiting_counts atomic variable is compared against the wrong
online cpu count. The latter is computed incorrectly using logical-OR
instead of bit-OR. This patch fixes that.
Signed-off-by: Sivaram Nair <sivaramn@nvidia.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Acked-by: Colin Cross <ccross@android.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Since cpuidle_state.power_usage is a signed value, use INT_MAX (instead
of -1) to init the local copies so that functions that tries to find
cpuidle states with minimum power usage works correctly even if they use
non-negative values.
Signed-off-by: Sivaram Nair <sivaramn@nvidia.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Pull ARM SoC updates from Olof Johansson:
"This contains the bulk of new SoC development for this merge window.
Two new platforms have been added, the sunxi platforms (Allwinner A1x
SoCs) by Maxime Ripard, and a generic Broadcom platform for a new
series of ARMv7 platforms from them, where the hope is that we can
keep the platform code generic enough to have them all share one mach
directory. The new Broadcom platform is contributed by Christian
Daudt.
Highbank has grown support for Calxeda's next generation of hardware,
ECX-2000.
clps711x has seen a lot of cleanup from Alexander Shiyan, and he's
also taken on maintainership of the platform.
Beyond this there has been a bunch of work from a number of people on
converting more platforms to IRQ domains, pinctrl conversion, cleanup
and general feature enablement across most of the active platforms."
Fix up trivial conflicts as per Olof.
* tag 'soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (174 commits)
mfd: vexpress-sysreg: Remove LEDs code
irqchip: irq-sunxi: Add terminating entry for sunxi_irq_dt_ids
clocksource: sunxi_timer: Add terminating entry for sunxi_timer_dt_ids
irq: versatile: delete dangling variable
ARM: sunxi: add missing include for mdelay()
ARM: EXYNOS: Avoid early use of of_machine_is_compatible()
ARM: dts: add node for PL330 MDMA1 controller for exynos4
ARM: EXYNOS: Add support for secondary CPU bring-up on Exynos4412
ARM: EXYNOS: add UART3 to DEBUG_LL ports
ARM: S3C24XX: Add clkdev entry for camif-upll clock
ARM: SAMSUNG: Add s3c24xx/s3c64xx CAMIF GPIO setup helpers
ARM: sunxi: Add missing sun4i.dtsi file
pinctrl: samsung: Do not initialise statics to 0
ARM i.MX6: remove gate_mask from pllv3
ARM i.MX6: Fix ethernet PLL clocks
ARM i.MX6: rename PLLs according to datasheet
ARM i.MX6: Add pwm support
ARM i.MX51: Add pwm support
ARM i.MX53: Add pwm support
ARM: mx5: Replace clk_register_clkdev with clock DT lookup
...
|
|
Many cpuidle drivers measure their time spent in an idle state by
reading the wallclock time before and after idling and calculating the
difference. This leads to erroneous results when the wallclock time gets
updated by another processor in the meantime, adding that clock
adjustment to the idle state's time counter.
If the clock adjustment was negative, the result is even worse due to an
erroneous cast from int to unsigned long long of the last_residency
variable. The negative 32 bit integer will zero-extend and result in a
forward time jump of roughly four billion milliseconds or 1.3 hours on
the idle state residency counter.
This patch changes all affected cpuidle drivers to either use the
monotonic clock for their measurements or make use of the generic time
measurement wrapper in cpuidle.c, which was already working correctly.
Some superfluous CLIs/STIs in the ACPI code are removed (interrupts
should always already be disabled before entering the idle function, and
not get reenabled until the generic wrapper has performed its second
measurement). It also removes the erroneous cast, making sure that
negative residency values are applied correctly even though they should
not appear anymore.
Signed-off-by: Julius Werner <jwerner@chromium.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
I saw this suspicious RCU usage on the next tree of 11/15
[ 67.123404] ===============================
[ 67.123413] [ INFO: suspicious RCU usage. ]
[ 67.123423] 3.7.0-rc5-next-20121115-dirty #1 Not tainted
[ 67.123434] -------------------------------
[ 67.123444] include/trace/events/timer.h:186 suspicious rcu_dereference_check() usage!
[ 67.123458]
[ 67.123458] other info that might help us debug this:
[ 67.123458]
[ 67.123474]
[ 67.123474] RCU used illegally from idle CPU!
[ 67.123474] rcu_scheduler_active = 1, debug_locks = 0
[ 67.123493] RCU used illegally from extended quiescent state!
[ 67.123507] 1 lock held by swapper/1/0:
[ 67.123516] #0: (&cpu_base->lock){-.-...}, at: [<c0000000000979b0>] .__hrtimer_start_range_ns+0x28c/0x524
[ 67.123555]
[ 67.123555] stack backtrace:
[ 67.123566] Call Trace:
[ 67.123576] [c0000001e2ccb920] [c00000000001275c] .show_stack+0x78/0x184 (unreliable)
[ 67.123599] [c0000001e2ccb9d0] [c0000000000c15a0] .lockdep_rcu_suspicious+0x120/0x148
[ 67.123619] [c0000001e2ccba70] [c00000000009601c] .enqueue_hrtimer+0x1c0/0x1c8
[ 67.123639] [c0000001e2ccbb00] [c000000000097aa0] .__hrtimer_start_range_ns+0x37c/0x524
[ 67.123660] [c0000001e2ccbc20] [c0000000005c9698] .menu_select+0x508/0x5bc
[ 67.123678] [c0000001e2ccbd20] [c0000000005c740c] .cpuidle_idle_call+0xa8/0x6e4
[ 67.123699] [c0000001e2ccbdd0] [c0000000000459a0] .pSeries_idle+0x10/0x34
[ 67.123717] [c0000001e2ccbe40] [c000000000014dc8] .cpu_idle+0x130/0x280
[ 67.123738] [c0000001e2ccbee0] [c0000000006ffa8c] .start_secondary+0x378/0x384
[ 67.123758] [c0000001e2ccbf90] [c00000000000936c] .start_secondary_prolog+0x10/0x14
hrtimer_start was added in 198fd638 and ae515197. The patch below tries
to use RCU_NONIDLE around it to avoid the above report.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
With the tegra3 and the big.LITTLE [1] new architectures, several cpus
with different characteristics (latencies and states) can co-exists on the
system.
The cpuidle framework has the limitation of handling only identical cpus.
This patch removes this limitation by introducing the multiple driver support
for cpuidle.
This option is configurable at compile time and should be enabled for the
architectures mentioned above. So there is no impact for the other platforms
if the option is disabled. The option defaults to 'n'. Note the multiple drivers
support is also compatible with the existing drivers, even if just one driver is
needed, all the cpu will be tied to this driver using an extra small chunk of
processor memory.
The multiple driver support use a per-cpu driver pointer instead of a global
variable and the accessor to this variable are done from a cpu context.
In order to keep the compatibility with the existing drivers, the function
'cpuidle_register_driver' and 'cpuidle_unregister_driver' will register
the specified driver for all the cpus.
The semantic for the output of /sys/devices/system/cpu/cpuidle/current_driver
remains the same except the driver name will be related to the current cpu.
The /sys/devices/system/cpu/cpu[0-9]/cpuidle/driver/name files are added
allowing to read the per cpu driver name.
[1] http://lwn.net/Articles/481055/
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter De Schrijver <pdeschrijver@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
This patch is a preparation for the multiple cpuidle drivers support.
As the next patch will introduce the multiple drivers with the Kconfig
option and we want to keep the code clean and understandable, this patch
defines a set of functions for encapsulating some common parts and splits
what should be done under a lock from the rest.
[rjw: Modified the subject and changelog slightly.]
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter De Schrijver <pdeschrijver@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The code is racy and the check with cpuidle_curr_driver should be
done under the lock.
I don't find a path in the different drivers where that could happen
because the arch specific drivers are written in such way it is not
possible to register a driver while it is unregistered, except maybe
in a very improbable case when "intel_idle" and "processor_idle" are
competing. One could unregister a driver, while the other one is
registering.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter De Schrijver <pdeschrijver@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
We want to support different cpuidle drivers co-existing together.
In this case we should move the refcount to the cpuidle_driver
structure to handle several drivers at a time.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter De Schrijver <pdeschrijver@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The "struct device" is only used in sysfs.c.
The other .c files including the private header "cpuidle.h"
do not need to pull the entire headers tree from there as they
don't manipulate the "struct device".
This patch fixes this by moving the header inclusion to sysfs.c
and adding a forward declaration for the struct device.
The number of lines generated by the preprocesor:
Without this patch : 17269 loc
With this patch : 16446 loc
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The structure cpuidle_state_kobj is not used anywhere except
in the sysfs.c file. The definition of this structure is not
needed in the cpuidle header file. This patch moves it to the
sysfs.c file in order to encapsulate the code a bit more.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The function detect_repeating_patterns was not very useful for
workloads with alternating long and short pauses, for example
virtual machines handling network requests for each other (say
a web and database server).
Instead, try to find a recent sleep interval that is somewhere
between the median and the mode sleep time, by discarding outliers
to the up side and recalculating the average and standard deviation
until that is no longer required.
This should do something sane with a sleep interval series like:
200 180 210 10000 30 1000 170 200
The current code would simply discard such a series, while the
new code will guess a typical sleep interval just shy of 200.
The original patch come from Rik van Riel <riel@redhat.com>.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
When cpuidle governor choose a C-state to enter for idle CPU, but it notice that
there is tasks request to be executed. So the idle CPU will not really enter
the target C-state and go to run task.
In this situation, it will use the residency of previous really entered target
C-states. Obviously, it is not reasonable.
So, this patch fix it by set the target C-state residency to 0.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.
The patch extends to general case that prediction logic get a small predicted
residency, so it choose a shallow C-state though the expected residency is large
. Once the prediction will be fail, the CPU will keep staying at shallow C-state
for a long time. Acutally, the CPU has change enter into deep C-state.
So when the expected residency is long enough but governor choose a shallow
C-state, an timer will be added in order to monitor if the prediction failure.
When C-state is waken up prior to the adding timer, the timer will be cancelled
initiatively. When the timer is triggered and menu governor will quickly notice
prediction failure and re-evaluates deeper C-states possibility.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.
cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.
There is a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
governor will predict it is repeat mode and there is another IPI wake up idle
CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
idle until break event occurs.
In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
deep C-state stays at >99.98%.
In the patch, a timer is added when menu governor detects a repeat mode and
choose a shallow C-state. The timer is set to a time out value that greater
than predicted time, and we conclude repeat mode prediction failure if timer is
triggered. When repeat mode happens as expected, the timer is not triggered
and CPU waken up from C-states and it will cancel the timer initiatively.
When repeat mode does not happen, the timer will be time out and menu governor
will quickly notice that the repeat mode prediction fails and then re-evaluates
deeper C-states possibility.
Below is another case which will clearly show the patch much benefit:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>
volatile int * shutdown;
volatile long * count;
int delay = 20;
int loop = 8;
void usage(void)
{
fprintf(stderr,
"Usage: idle_predict [options]\n"
" --help -h Print this help\n"
" --thread -n Thread number\n"
" --loop -l Loop times in shallow Cstate\n"
" --delay -t Sleep time (uS)in shallow Cstate\n");
}
void *simple_loop() {
int idle_num = 1;
while (!(*shutdown)) {
*count = *count + 1;
if (idle_num % loop)
usleep(delay);
else {
/* sleep 1 second */
usleep(1000000);
idle_num = 0;
}
idle_num++;
}
}
static void sighand(int sig)
{
*shutdown = 1;
}
int main(int argc, char *argv[])
{
sigset_t sigset;
int signum = SIGALRM;
int i, c, er = 0, thread_num = 8;
pthread_t pt[1024];
static char optstr[] = "n:l:t:h:";
while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 'n':
thread_num = atoi(optarg);
break;
case 'l':
loop = atoi(optarg);
break;
case 't':
delay = atoi(optarg);
break;
case 'h':
default:
usage();
exit(1);
}
printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
count = malloc(sizeof(long));
shutdown = malloc(sizeof(int));
*count = 0;
*shutdown = 0;
sigemptyset(&sigset);
sigaddset(&sigset, signum);
sigprocmask (SIG_BLOCK, &sigset, NULL);
signal(SIGINT, sighand);
signal(SIGTERM, sighand);
for(i = 0; i < thread_num ; i++)
pthread_create(&pt[i], NULL, simple_loop, NULL);
for (i = 0; i < thread_num; i++)
pthread_join(pt[i], NULL);
exit(0);
}
Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
After build the above test application, then run it.
Test plaform can be Intel Sandybridge or other recent platforms.
#./idle_predict -l 10 &
#./powertop
We will find that deep C-state will dangle between 40%~100% and much time spent
on C1 state. It is because menu governor wrongly predict that repeat mode
is kept, so it will choose the C1 shallow C-state even though it has chance to
sleep 1 second in deep C-state.
While after patched the kernel, we find that deep C-state will keep >99.6%.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Move the kobj initialization and completion in the sysfs.c
and encapsulate the code more.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The function needs the cpuidle_device which is initially passed to the
caller.
The current code gets the struct device from the struct cpuidle_device,
pass it the cpuidle_add_sysfs function. This function calls
per_cpu(cpuidle_devices, cpu) to get the cpuidle_device.
This patch pass the cpuidle_device instead and simplify the code.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add support for core powergating on Calxeda platforms. Initially, this
supports ECX-1000 (highbank), but support will be added for ECX-2000
later.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
|
|
On a KVM guest, when a CPU is taken offline and brought back online, we hit
the following NULL pointer dereference:
[ 45.400843] Unregister pv shared memory for cpu 1
[ 45.412331] smpboot: CPU 1 is now offline
[ 45.529894] SMP alternatives: lockdep: fixing up alternatives
[ 45.533472] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 45.411526] kvm-clock: cpu 1, msr 0:7d14601, secondary cpu clock
[ 45.571370] KVM setup async PF for cpu 1
[ 45.572331] kvm-stealtime: cpu 1, msr 7d0e040
[ 45.575031] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 45.576017] IP: [<ffffffff81519f98>] cpuidle_disable_device+0x18/0x80
[ 45.576017] PGD 5dfb067 PUD 5da8067 PMD 0
[ 45.576017] Oops: 0000 [#1] SMP
[ 45.576017] Modules linked in:
[ 45.576017] CPU 0
[ 45.576017] Pid: 607, comm: stress_cpu_hotp Not tainted 3.6.0-padata-tp-debug #3 Bochs Bochs
[ 45.576017] RIP: 0010:[<ffffffff81519f98>] [<ffffffff81519f98>] cpuidle_disable_device+0x18/0x80
[ 45.576017] RSP: 0018:ffff880005d93ce8 EFLAGS: 00010286
[ 45.576017] RAX: ffff880005d93fd8 RBX: 0000000000000000 RCX: 0000000000000006
[ 45.576017] RDX: 0000000000000006 RSI: 2222222222222222 RDI: 0000000000000000
[ 45.576017] RBP: ffff880005d93cf8 R08: 2222222222222222 R09: 2222222222222222
[ 45.576017] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 45.576017] R13: 0000000000000000 R14: ffffffff81c8cca0 R15: 0000000000000001
[ 45.576017] FS: 00007f91936ae700(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000
[ 45.576017] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 45.576017] CR2: 0000000000000000 CR3: 0000000005db3000 CR4: 00000000000006f0
[ 45.576017] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 45.576017] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 45.576017] Process stress_cpu_hotp (pid: 607, threadinfo ffff880005d92000, task ffff8800066bbf40)
[ 45.576017] Stack:
[ 45.576017] ffff880007a96400 0000000000000000 ffff880005d93d28 ffffffff813ac689
[ 45.576017] ffff880007a96400 ffff880007a96400 0000000000000002 ffffffff81cd8d01
[ 45.576017] ffff880005d93d58 ffffffff813aa498 0000000000000001 00000000ffffffdd
[ 45.576017] Call Trace:
[ 45.576017] [<ffffffff813ac689>] acpi_processor_hotplug+0x55/0x97
[ 45.576017] [<ffffffff813aa498>] acpi_cpu_soft_notify+0x93/0xce
[ 45.576017] [<ffffffff816ae47d>] notifier_call_chain+0x5d/0x110
[ 45.576017] [<ffffffff8109730e>] __raw_notifier_call_chain+0xe/0x10
[ 45.576017] [<ffffffff81069050>] __cpu_notify+0x20/0x40
[ 45.576017] [<ffffffff81069085>] cpu_notify+0x15/0x20
[ 45.576017] [<ffffffff816978f1>] _cpu_up+0xee/0x137
[ 45.576017] [<ffffffff81697983>] cpu_up+0x49/0x59
[ 45.576017] [<ffffffff8168758d>] store_online+0x9d/0xe0
[ 45.576017] [<ffffffff8140a9f8>] dev_attr_store+0x18/0x30
[ 45.576017] [<ffffffff812322c0>] sysfs_write_file+0xe0/0x150
[ 45.576017] [<ffffffff811b389c>] vfs_write+0xac/0x180
[ 45.576017] [<ffffffff811b3be2>] sys_write+0x52/0xa0
[ 45.576017] [<ffffffff816b31e9>] system_call_fastpath+0x16/0x1b
[ 45.576017] Code: 48 c7 c7 40 e5 ca 81 e8 07 d0 18 00 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 10 48 89 5d f0 4c 89 65 f8 48 89 fb <f6> 07 02 75 13 48 8b 5d f0 4c 8b 65 f8 c9 c3 66 0f 1f 84 00 00
[ 45.576017] RIP [<ffffffff81519f98>] cpuidle_disable_device+0x18/0x80
[ 45.576017] RSP <ffff880005d93ce8>
[ 45.576017] CR2: 0000000000000000
[ 45.656079] ---[ end trace 433d6c9ac0b02cef ]---
Analysis:
Commit 3d339dc (cpuidle / ACPI : move cpuidle_device field out of the
acpi_processor_power structure()) made the allocation of the dev structure
(struct cpuidle) of a CPU dynamic, whereas previously it was statically
allocated. And this dynamic allocation occurs in acpi_processor_power_init()
if pr->flags.power evaluates to non-zero.
On KVM guests, pr->flags.power evaluates to zero, hence dev is never
allocated. This causes the NULL pointer (dev) dereference in
cpuidle_disable_device() during a subsequent CPU online operation. Fix this
by ensuring that dev is non-NULL before dereferencing.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
The function __cpuidle_register_driver name is confusing because it
suggests, conforming to the coding style of the kernel, it registers
the driver without taking a lock. Actually, it just fill the different
power field states with a decresing value if the power has not been
specified.
Clarify the purpose of the function by changing its name and
move the condition out of this function.
This patch fix nothing and does not change the behavior of the
function. It is just for the sake of clarity.
IHMO, reading in the code:
+ if (!drv->power_specified)
+ set_power_states(drv);
is much more explicit than:
- __cpuidle_register_driver(drv);
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
|
|
This mindless patch is just about removing some trailing
carriage returns.
[rjw: Changed the subject.]
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
|
|
For the mechanism introduced by commit cbc9ef0 (PM / Domains: Add
preliminary support for cpuidle, v2) to work with the ladder
governor, that governor should respect the "disabled" state flag
added by that commit. Change the ladder governor accordingly.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
|
|
There are two cpuidle governors ladder and menu. While the ladder
governor is always available, if CONFIG_CPU_IDLE is selected, the
menu governor additionally requires CONFIG_NO_HZ.
A particular C state can be disabled by writing to the sysfs file
/sys/devices/system/cpu/cpuN/cpuidle/stateN/disable, but this mechanism
is only implemented in the menu governor. Thus, in a system where
CONFIG_NO_HZ is not selected, the ladder governor becomes default and
always will walk through all sleep states - irrespective of whether the
C state was disabled via sysfs or not. The only way to select a specific
C state was to write the related latency to /dev/cpu_dma_latency and
keep the file open as long as this setting was required - not very
practical and not suitable for setting a single core in an SMP system.
With this patch, the ladder governor only will promote to the next
C state, if it has not been disabled, and it will demote, if the
current C state was disabled.
Note that the patch does not make the setting of the sysfs variable
"disable" coherent, i.e. if one is disabling a light state, then all
deeper states are disabled as well, but the "disable" variable does not
reflect it. Likewise, if one enables a deep state but a lighter state
still is disabled, then this has no effect. A related section has been
added to the documentation.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
|