summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-10-10MODSIGN: Implement module signature checkingDavid Howells
Check the signature on the module against the keys compiled into the kernel or available in a hardware key store. Currently, only RSA keys are supported - though that's easy enough to change, and the signature is expected to contain raw components (so not a PGP or PKCS#7 formatted blob). The signature blob is expected to consist of the following pieces in order: (1) The binary identifier for the key. This is expected to match the SubjectKeyIdentifier from an X.509 certificate. Only X.509 type identifiers are currently supported. (2) The signature data, consisting of a series of MPIs in which each is in the format of a 2-byte BE word sizes followed by the content data. (3) A 12 byte information block of the form: struct module_signature { enum pkey_algo algo : 8; enum pkey_hash_algo hash : 8; enum pkey_id_type id_type : 8; u8 __pad; __be32 id_length; __be32 sig_length; }; The three enums are defined in crypto/public_key.h. 'algo' contains the public-key algorithm identifier (0->DSA, 1->RSA). 'hash' contains the digest algorithm identifier (0->MD4, 1->MD5, 2->SHA1, etc.). 'id_type' contains the public-key identifier type (0->PGP, 1->X.509). '__pad' should be 0. 'id_length' should contain in the binary identifier length in BE form. 'sig_length' should contain in the signature data length in BE form. The lengths are in BE order rather than CPU order to make dealing with cross-compilation easier. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (minor Kconfig fix)
2012-10-10MODSIGN: Provide module signing public keys to the kernelDavid Howells
Include a PGP keyring containing the public keys required to perform module verification in the kernel image during build and create a special keyring during boot which is then populated with keys of crypto type holding the public keys found in the PGP keyring. These can be seen by root: [root@andromeda ~]# cat /proc/keys 07ad4ee0 I----- 1 perm 3f010000 0 0 crypto modsign.0: RSA 87b9b3bd [] 15c7f8c3 I----- 1 perm 1f030000 0 0 keyring .module_sign: 1/4 ... It is probably worth permitting root to invalidate these keys, resulting in their removal and preventing further modules from being loaded with that key. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10MODSIGN: Automatically generate module signing keys if missingDavid Howells
Automatically generate keys for module signing if they're absent so that allyesconfig doesn't break. The builder should consider generating their own key and certificate, however, so that the keys are appropriately named. The private key for the module signer should be placed in signing_key.priv (unencrypted!) and the public key in an X.509 certificate as signing_key.x509. If a transient key is desired for signing the modules, a config file for 'openssl req' can be placed in x509.genkey, looking something like the following: [ req ] default_bits = 4096 distinguished_name = req_distinguished_name prompt = no x509_extensions = myexts [ req_distinguished_name ] O = Magarathea CN = Glacier signing key emailAddress = slartibartfast@magrathea.h2g2 [ myexts ] basicConstraints=critical,CA:FALSE keyUsage=digitalSignature subjectKeyIdentifier=hash authorityKeyIdentifier=hash The build process will use this to configure: openssl req -new -nodes -utf8 -sha1 -days 36500 -batch \ -x509 -config x509.genkey \ -outform DER -out signing_key.x509 \ -keyout signing_key.priv to generate the key. Note that it is required that the X.509 certificate have a subjectKeyIdentifier and an authorityKeyIdentifier. Without those, the certificate will be rejected. These can be used to check the validity of a certificate. Note that 'make distclean' will remove signing_key.{priv,x509} and x509.genkey, whether or not they were generated automatically. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10MODSIGN: Provide Kconfig optionsDavid Howells
Provide kernel configuration options for module signing. The following configuration options are added: CONFIG_MODULE_SIG_SHA1 CONFIG_MODULE_SIG_SHA224 CONFIG_MODULE_SIG_SHA256 CONFIG_MODULE_SIG_SHA384 CONFIG_MODULE_SIG_SHA512 These select the cryptographic hash used to digest the data prior to signing. Additionally, the crypto module selected will be built into the kernel as it won't be possible to load it as a module without incurring a circular dependency when the kernel tries to check its signature. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10MODSIGN: Provide gitignore and make clean rules for extra filesDavid Howells
Provide gitignore and make clean rules for extra files to hide and clean up the extra files produced by module signing stuff once it is added. Also add a clean up rule for the module content extractor program used to extract the data to be signed. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10MODSIGN: Add FIPS policyDavid Howells
If we're in FIPS mode, we should panic if we fail to verify the signature on a module or we're asked to load an unsigned module in signature enforcing mode. Possibly FIPS mode should automatically enable enforcing mode. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10module: signature checking hookRusty Russell
We do a very simple search for a particular string appended to the module (which is cache-hot and about to be SHA'd anyway). There's both a config option and a boot parameter which control whether we accept or fail with unsigned modules and modules that are signed with an unknown key. If module signing is enabled, the kernel will be tainted if a module is loaded that is unsigned or has a signature for which we don't have the key. (Useful feedback and tweaks by David Howells <dhowells@redhat.com>) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08X.509: Add a crypto key parser for binary (DER) X.509 certificatesDavid Howells
Add a crypto key parser for binary (DER) encoded X.509 certificates. The certificate is parsed and, if possible, the signature is verified. An X.509 key can be added like this: # keyctl padd crypto bar @s </tmp/x509.cert 15768135 and displayed like this: # cat /proc/keys 00f09a47 I--Q--- 1 perm 39390000 0 0 asymmetri bar: X509.RSA e9fd6d08 [] Note that this only works with binary certificates. PEM encoded certificates are ignored by the parser. Note also that the X.509 key ID is not congruent with the PGP key ID, but for the moment, they will match. If a NULL or "" name is given to add_key(), then the parser will generate a key description from the CertificateSerialNumber and Name fields of the TBSCertificate: 00aefc4e I--Q--- 1 perm 39390000 0 0 asymmetri bfbc0cd76d050ea4:/C=GB/L=Cambridge/O=Red Hat/CN=kernel key: X509.RSA 0c688c7b [] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08MPILIB: Provide a function to read raw data into an MPIDavid Howells
Provide a function to read raw data of a predetermined size into an MPI rather than expecting the size to be encoded within the data. The data is assumed to represent an unsigned integer, and the resulting MPI will be positive. The function looks like this: MPI mpi_read_raw_data(const void *, size_t); This is useful for reading ASN.1 integer primitives where the length is encoded in the ASN.1 metadata. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08X.509: Add an ASN.1 decoderDavid Howells
Add an ASN.1 BER/DER/CER decoder. This uses the bytecode from the ASN.1 compiler in the previous patch to inform it as to what to expect to find in the encoded byte stream. The output from the compiler also tells it what functions to call on what tags, thus allowing the caller to retrieve information. The decoder is called as follows: int asn1_decoder(const struct asn1_decoder *decoder, void *context, const unsigned char *data, size_t datalen); The decoder argument points to the bytecode from the ASN.1 compiler. context is the caller's context and is passed to the action functions. data and datalen define the byte stream to be decoded. Note that the decoder is currently limited to datalen being less than 64K. This reduces the amount of stack space used by the decoder because ASN.1 is a nested construct. Similarly, the decoder is limited to a maximum of 10 levels of constructed data outside of a leaf node also in an effort to keep stack usage down. These restrictions can be raised if necessary. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08X.509: Add simple ASN.1 grammar compilerDavid Howells
Add a simple ASN.1 grammar compiler. This produces a bytecode output that can be fed to a decoder to inform the decoder how to interpret the ASN.1 stream it is trying to parse. Action functions can be specified in the grammar by interpolating: ({ foo }) after a type, for example: SubjectPublicKeyInfo ::= SEQUENCE { algorithm AlgorithmIdentifier, subjectPublicKey BIT STRING ({ do_key_data }) } The decoder is expected to call these after matching this type and parsing the contents if it is a constructed type. The grammar compiler does not currently support the SET type (though it does support SET OF) as I can't see a good way of tracking which members have been encountered yet without using up extra stack space. Currently, the grammar compiler will fail if more than 256 bytes of bytecode would be produced or more than 256 actions have been specified as it uses 8-bit jump values and action indices to keep space usage down. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08X.509: Add utility functions to render OIDs as stringsDavid Howells
Add a pair of utility functions to render OIDs as strings. The first takes an encoded OID and turns it into a "a.b.c.d" form string: int sprint_oid(const void *data, size_t datasize, char *buffer, size_t bufsize); The second takes an OID enum index and calls the first on the data held therein: int sprint_OID(enum OID oid, char *buffer, size_t bufsize); Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08X.509: Implement simple static OID registryDavid Howells
Implement a simple static OID registry that allows the mapping of an encoded OID to an enum value for ease of use. The OID registry index enum appears in the: linux/oid_registry.h header file. A script generates the registry from lines in the header file that look like: <sp*>OID_foo,<sp*>/*<sp*>1.2.3.4<sp*>*/ The actual OID is taken to be represented by the numbers with interpolated dots in the comment. All other lines in the header are ignored. The registry is queries by calling: OID look_up_oid(const void *data, size_t datasize); This returns a number from the registry enum representing the OID if found or OID__NR if not. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08RSA: Fix signature verification for shorter signaturesDavid Howells
gpg can produce a signature file where length of signature is less than the modulus size because the amount of space an MPI takes up is kept as low as possible by discarding leading zeros. This regularly happens for several modules during the build. Fix it by relaxing check in RSA verification code. Thanks to Tomas Mraz and Miloslav Trmac for help. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08RSA: Implement signature verification algorithm [PKCS#1 / RFC3447]David Howells
Implement RSA public key cryptography [PKCS#1 / RFC3447]. At this time, only the signature verification algorithm is supported. This uses the asymmetric public key subtype to hold its key data. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08MPILIB: Reinstate mpi_cmp[_ui]() and export for RSA signature verificationDavid Howells
Reinstate and export mpi_cmp() and mpi_cmp_ui() from the MPI library for use by RSA signature verification as per RFC3447 section 5.2.2 step 1. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08KEYS: Provide signature verification with an asymmetric keyDavid Howells
Provide signature verification using an asymmetric-type key to indicate the public key to be used. The API is a single function that can be found in crypto/public_key.h: int verify_signature(const struct key *key, const struct public_key_signature *sig) The first argument is the appropriate key to be used and the second argument is the parsed signature data: struct public_key_signature { u8 *digest; u16 digest_size; enum pkey_hash_algo pkey_hash_algo : 8; union { MPI mpi[2]; struct { MPI s; /* m^d mod n */ } rsa; struct { MPI r; MPI s; } dsa; }; }; This should be filled in prior to calling the function. The hash algorithm should already have been called and the hash finalised and the output should be in a buffer pointed to by the 'digest' member. Any extra data to be added to the hash by the hash format (eg. PGP) should have been added by the caller prior to finalising the hash. It is assumed that the signature is made up of a number of MPI values. If an algorithm becomes available for which this is not the case, the above structure will have to change. It is also assumed that it will have been checked that the signature algorithm matches the key algorithm. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08KEYS: Asymmetric public-key algorithm crypto key subtypeDavid Howells
Add a subtype for supporting asymmetric public-key encryption algorithms such as DSA (FIPS-186) and RSA (PKCS#1 / RFC1337). Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08KEYS: Asymmetric key pluggable data parsersDavid Howells
The instantiation data passed to the asymmetric key type are expected to be formatted in some way, and there are several possible standard ways to format the data. The two obvious standards are OpenPGP keys and X.509 certificates. The latter is especially useful when dealing with UEFI, and the former might be useful when dealing with, say, eCryptfs. Further, it might be desirable to provide formatted blobs that indicate hardware is to be accessed to retrieve the keys or that the keys live unretrievably in a hardware store, but that the keys can be used by means of the hardware. From userspace, the keys can be loaded using the keyctl command, for example, an X.509 binary certificate: keyctl padd asymmetric foo @s <dhowells.pem or a PGP key: keyctl padd asymmetric bar @s <dhowells.pub or a pointer into the contents of the TPM: keyctl add asymmetric zebra "TPM:04982390582905f8" @s Inside the kernel, pluggable parsers register themselves and then get to examine the payload data to see if they can handle it. If they can, they get to: (1) Propose a name for the key, to be used it the name is "" or NULL. (2) Specify the key subtype. (3) Provide the data for the subtype. The key type asks the parser to do its stuff before a key is allocated and thus before the name is set. If successful, the parser stores the suggested data into the key_preparsed_payload struct, which will be either used (if the key is successfully created and instantiated or updated) or discarded. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08KEYS: Implement asymmetric key typeDavid Howells
Create a key type that can be used to represent an asymmetric key type for use in appropriate cryptographic operations, such as encryption, decryption, signature generation and signature verification. The key type is "asymmetric" and can provide access to a variety of cryptographic algorithms. Possibly, this would be better as "public_key" - but that has the disadvantage that "public key" is an overloaded term. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08KEYS: Document asymmetric key typeDavid Howells
In-source documentation for the asymmetric key type. This will be located in: Documentation/crypto/asymmetric-keys.txt Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08MPILIB: Provide count_leading/trailing_zeros() based on arch functionsDavid Howells
Provide count_leading/trailing_zeros() macros based on extant arch bit scanning functions rather than reimplementing from scratch in MPILIB. Whilst we're at it, turn count_foo_zeros(n, x) into n = count_foo_zeros(x). Also move the definition to asm-generic as other people may be interested in using it. Signed-off-by: David Howells <dhowells@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Kasatkin <dmitry.kasatkin@intel.com> Cc: Arnd Bergmann <arnd@arndb.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08KEYS: Add payload preparsing opportunity prior to key instantiate or updateDavid Howells
Give the key type the opportunity to preparse the payload prior to the instantiation and update routines being called. This is done with the provision of two new key type operations: int (*preparse)(struct key_preparsed_payload *prep); void (*free_preparse)(struct key_preparsed_payload *prep); If the first operation is present, then it is called before key creation (in the add/update case) or before the key semaphore is taken (in the update and instantiate cases). The second operation is called to clean up if the first was called. preparse() is given the opportunity to fill in the following structure: struct key_preparsed_payload { char *description; void *type_data[2]; void *payload; const void *data; size_t datalen; size_t quotalen; }; Before the preparser is called, the first three fields will have been cleared, the payload pointer and size will be stored in data and datalen and the default quota size from the key_type struct will be stored into quotalen. The preparser may parse the payload in any way it likes and may store data in the type_data[] and payload fields for use by the instantiate() and update() ops. The preparser may also propose a description for the key by attaching it as a string to the description field. This can be used by passing a NULL or "" description to the add_key() system call or the key_create_or_update() function. This cannot work with request_key() as that required the description to tell the upcall about the key to be created. This, for example permits keys that store PGP public keys to generate their own name from the user ID and public key fingerprint in the key. The instantiate() and update() operations are then modified to look like this: int (*instantiate)(struct key *key, struct key_preparsed_payload *prep); int (*update)(struct key *key, struct key_preparsed_payload *prep); and the new payload data is passed in *prep, whether or not it was preparsed. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28module: wait when loading a module which is currently initializing.Rusty Russell
The original module-init-tools module loader used a fnctl lock on the .ko file to avoid attempts to simultaneously load a module. Unfortunately, you can't get an exclusive fcntl lock on a read-only fd, making this not work for read-only mounted filesystems. module-init-tools has a hacky sleep-and-loop for this now. It's not that hard to wait in the kernel, and only return -EEXIST once the first module has finished loading (or continue loading the module if the first one failed to initialize for some reason). It's also consistent with what we do for dependent modules which are still loading. Suggested-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28module: fix symbol waiting when module fails before initRusty Russell
We use resolve_symbol_wait(), which blocks if the module containing the symbol is still loading. However: 1) The module_wq we use is only woken after calling the modules' init function, but there are other failure paths after the module is placed in the linked list where we need to do the same thing. 2) wake_up() only wakes one waiter, and our waitqueue is shared by all modules, so we need to wake them all. 3) wake_up_all() doesn't imply a memory barrier: I feel happier calling it after we've grabbed and dropped the module_mutex, not just after the state assignment. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28Make most arch asm/module.h files use asm-generic/module.hDavid Howells
Use the mapping of Elf_[SPE]hdr, Elf_Addr, Elf_Sym, Elf_Dyn, Elf_Rel/Rela, ELF_R_TYPE() and ELF_R_SYM() to either the 32-bit version or the 64-bit version into asm-generic/module.h for all arches bar MIPS. Also, use the generic definition mod_arch_specific where possible. To this end, I've defined three new config bools: (*) HAVE_MOD_ARCH_SPECIFIC Arches define this if they don't want to use the empty generic mod_arch_specific struct. (*) MODULES_USE_ELF_RELA Arches define this if their modules can contain RELA records. This causes the Elf_Rela mapping to be emitted and allows apply_relocate_add() to be defined by the arch rather than have the core emit an error message. (*) MODULES_USE_ELF_REL Arches define this if their modules can contain REL records. This causes the Elf_Rel mapping to be emitted and allows apply_relocate() to be defined by the arch rather than have the core emit an error message. Note that it is possible to allow both REL and RELA records: m68k and mips are two arches that do this. With this, some arch asm/module.h files can be deleted entirely and replaced with a generic-y marker in the arch Kbuild file. Additionally, I have removed the bits from m32r and score that handle the unsupported type of relocation record as that's now handled centrally. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28MIPS: Fix module.c build for 32 bitRalf Baechle
Fixes build failure introduced by "Make most arch asm/module.h files use asm-generic/module.h" by moving all the RELA processing code to a separate file to be used only for RELA processing on 64-bit kernels. CC arch/mips/kernel/module.o arch/mips/kernel/module.c:250:14: error: 'reloc_handlers_rela' defined but not used [-Werror=unused-variable] cc1: all warnings being treated as errors make[6]: *** [arch/mips/kernel/module.o] Error 1 Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28module: taint kernel when lve module is loadedMatthew Garrett
Cloudlinux have a product called lve that includes a kernel module. This was previously GPLed but is now under a proprietary license, but the module continues to declare MODULE_LICENSE("GPL") and makes use of some EXPORT_SYMBOL_GPL symbols. Forcibly taint it in order to avoid this. Signed-off-by: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Alex Lyashkov <umka@cloudlinux.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org
2012-09-18Merge tag 'hwspinlock-3.6-fix' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ohad/hwspinlock Pull hwspinlock fix from Ohad Ben-Cohen: "A single hwspinlock fix by Wei Yongjun, which prevents potential NULL dereferences" * tag 'hwspinlock-3.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/hwspinlock: hwspinlock/core: move the dereference below the NULL test
2012-09-18vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill()Miklos Szeredi
IBM reported a soft lockup after applying the fix for the rename_lock deadlock. Commit c83ce989cb5f ("VFS: Fix the nfs sillyrename regression in kernel 2.6.38") was found to be the culprit. The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the dentry was killed. This flag can be set on non-killed dentries too, which results in infinite retries when trying to traverse the dentry tree. This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is only set in d_kill() and makes try_to_ascend() test only this flag. IBM reported successful test results with this patch. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17Merge branch 'for-3.6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull another workqueue fix from Tejun Heo: "Unfortunately, yet another late fix. This too is discovered and fixed by Lai. This bug was introduced during this merge window by commit 25511a477657 ("workqueue: reimplement CPU online rebinding to handle idle workers") which started using WORKER_REBIND flag for idle rebind too. The bug is relatively easy to trigger if the CPU rapidly goes through off, on and then off (and stay off). The fix is on the safer side. This hasn't been on linux-next yet but I'm pushing early so that it can get more exposure before v3.6 release." * 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
2012-09-17workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()Lai Jiangshan
busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed (CPU is down again). This used to be okay because the flag wasn't used for anything else. However, after 25511a477 "workqueue: reimplement CPU online rebinding to handle idle workers", WORKER_REBIND is also used to command idle workers to rebind. If not cleared, the worker may confuse the next CPU_UP cycle by having REBIND spuriously set or oops / get stuck by prematurely calling idle_worker_rebind(). WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5 00() Hardware name: Bochs Modules linked in: test_wq(O-) Pid: 33, comm: kworker/1:1 Tainted: G O 3.6.0-rc1-work+ #3 Call Trace: [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500 [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 ---[ end trace e977cf20f4661968 ]--- BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: test_wq(O-) CPU 0 Pid: 33, comm: kworker/1:1 Tainted: G W O 3.6.0-rc1-work+ #3 Bochs Bochs RIP: 0010:[<ffffffff810b3db0>] [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP: 0018:ffff88001e1c9de0 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009 RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580 R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900 FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900) Stack: ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010 ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340 Call Trace: [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b RIP [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP <ffff88001e1c9de0> CR2: 0000000000000000 There was no reason to keep WORKER_REBIND on failure in the first place - WORKER_UNBOUND is guaranteed to be set in such cases preventing incorrectly activating concurrency management. Always clear WORKER_REBIND. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-09-17Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge fixes from Andrew Morton: "13 patches. 12 are fixes and one is a little preparatory thing for Andi." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (13 commits) memory hotplug: fix section info double registration bug mm/page_alloc: fix the page address of higher page's buddy calculation drivers/rtc/rtc-twl.c: ensure all interrupts are disabled during probe compiler.h: add __visible pid-namespace: limit value of ns_last_pid to (0, max_pid) include/net/sock.h: squelch compiler warning in sk_rmem_schedule() slub: consider pfmemalloc_match() in get_partial_node() slab: fix starting index for finding another object slab: do ClearSlabPfmemalloc() for all pages of slab nbd: clear waiting_queue on shutdown MAINTAINERS: fix TXT maintainer list and source repo path mm/ia64: fix a memory block size bug memory hotplug: reset pgdat->kswapd to NULL if creating kernel thread fails
2012-09-17memory hotplug: fix section info double registration bugqiuxishi
There may be a bug when registering section info. For example, on my Itanium platform, the pfn range of node0 includes the other nodes, so other nodes' section info will be double registered, and memmap's page count will equal to 3. node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00 node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000 node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000 node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000 free_all_bootmem_node() register_page_bootmem_info_node() register_page_bootmem_info_section() When hot remove memory, we can't free the memmap's page because page_count() is 2 after put_page_bootmem(). sparse_remove_one_section() free_section_usemap() free_map_bootmem() put_page_bootmem() [akpm@linux-foundation.org: add code comment] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17mm/page_alloc: fix the page address of higher page's buddy calculationLi Haifeng
The heuristic method for buddy has been introduced since commit 43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists"). But the page address of higher page's buddy was wrongly calculated, which will lead page_is_buddy to fail for ever. IOW, the heuristic method would be disabled with the wrong page address of higher page's buddy. Calculating the page address of higher page's buddy should be based higher_page with the offset between index of higher page and index of higher page's buddy. Signed-off-by: Haifeng Li <omycle@gmail.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KyongHo Cho <pullip.cho@samsung.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17drivers/rtc/rtc-twl.c: ensure all interrupts are disabled during probeKevin Hilman
On some platforms, bootloaders are known to do some interesting RTC programming. Without going into the obscurities as to why this may be the case, suffice it to say the the driver should not make any assumptions about the state of the RTC when the driver loads. In particular, the driver probe should be sure that all interrupts are disabled until otherwise programmed. This was discovered when finding bursty I2C traffic every second on Overo platforms. This I2C overhead was keeping the SoC from hitting deep power states. The cause was found to be the RTC firing every second on the I2C-connected TWL PMIC. Special thanks to Felipe Balbi for suggesting to look for a rogue driver as the source of the I2C traffic rather than the I2C driver itself. Special thanks to Steve Sakoman for helping track down the source of the continuous RTC interrups on the Overo boards. Signed-off-by: Kevin Hilman <khilman@ti.com> Cc: Felipe Balbi <balbi@ti.com> Tested-by: Steve Sakoman <steve@sakoman.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Tested-by: Shubhrajyoti Datta <omaplinuxkernel@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17compiler.h: add __visibleAndi Kleen
gcc 4.6+ has support for a externally_visible attribute that prevents the optimizer from optimizing unused symbols away. Add a __visible macro to use it with that compiler version or later. This is used (at least) by the "Link Time Optimization" patchset. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17pid-namespace: limit value of ns_last_pid to (0, max_pid)Andrew Vagin
The kernel doesn't check the pid for negative values, so if you try to write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic. The crash happens because the next pid is -1, and alloc_pidmap() will try to access to a nonexistent pidmap. map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17include/net/sock.h: squelch compiler warning in sk_rmem_schedule()Chuck Lever
This warning: In file included from linux/include/linux/tcp.h:227:0, from linux/include/linux/ipv6.h:221, from linux/include/net/ipv6.h:16, from linux/include/linux/sunrpc/clnt.h:26, from linux/net/sunrpc/stats.c:22: linux/include/net/sock.h: In function `sk_rmem_schedule': linux/nfs-2.6/include/net/sock.h:1339:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] is seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2) using the -Wextra option. Commit c76562b6709f ("netvm: prevent a stream-specific deadlock") accidentally replaced the "size" parameter of sk_rmem_schedule() with an unsigned int. This changes the semantics of the comparison in the return statement. In sk_wmem_schedule we have syntactically the same comparison, but "size" is a signed integer. In addition, __sk_mem_schedule() takes a signed integer for its "size" parameter, so there is an implicit type conversion in sk_rmem_schedule() anyway. Revert the "size" parameter back to a signed integer so that the semantics of the expressions in both sk_[rw]mem_schedule() are exactly the same. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17slub: consider pfmemalloc_match() in get_partial_node()Joonsoo Kim
get_partial() is currently not checking pfmemalloc_match() meaning that it is possible for pfmemalloc pages to leak to non-pfmemalloc users. This is a problem in the following situation. Assume that there is a request from normal allocation and there are no objects in the per-cpu cache and no node-partial slab. In this case, slab_alloc enters the slow path and new_slab_objects() is called which may return a PFMEMALLOC page. As the current user is not allowed to access PFMEMALLOC page, deactivate_slab() is called ([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc checks]) and returns an object from PFMEMALLOC page. Next time, when we get another request from normal allocation, slab_alloc() enters the slow-path and calls new_slab_objects(). In new_slab_objects(), we call get_partial() and get a partial slab which was just deactivated but is a pfmemalloc page. We extract one object from it and re-deactivate. "deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly. As a result, access to PFMEMALLOC page is not properly restricted and it can cause a performance degradation due to frequent deactivation. deactivation frequently. This patch changes get_partial_node() to take pfmemalloc_match() into account and prevents the "deactivate -> re-get in get_partial() scenario. Instead, new_slab() is called. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17slab: fix starting index for finding another objectJoonsoo Kim
In array cache, there is a object at index 0, check it. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17slab: do ClearSlabPfmemalloc() for all pages of slabMel Gorman
Right now, we call ClearSlabPfmemalloc() for first page of slab when we clear SlabPfmemalloc flag. This is fine for most swap-over-network use cases as it is expected that order-0 pages are in use. Unfortunately it is possible that that __ac_put_obj() checks SlabPfmemalloc on a tail page and while this is harmless, it is sloppy. This patch ensures that the head page is always used. This problem was originally identified by Joonsoo Kim. [js1304@gmail.com: Original implementation and problem identification] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17nbd: clear waiting_queue on shutdownPaul Clements
Fix a serious but uncommon bug in nbd which occurs when there is heavy I/O going to the nbd device while, at the same time, a failure (server, network) or manual disconnect of the nbd connection occurs. There is a small window between the time that the nbd_thread is stopped and the socket is shutdown where requests can continue to be queued to nbd's internal waiting_queue. When this happens, those requests are never completed or freed. The fix is to clear the waiting_queue on shutdown of the nbd device, in the same way that the nbd request queue (queue_head) is already being cleared. Signed-off-by: Paul Clements <paul.clements@steeleye.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17MAINTAINERS: fix TXT maintainer list and source repo pathGang Wei
Signed-off-by: Gang Wei <gang.wei@intel.com> Cc: Richard L Maliszewski <richard.l.maliszewski@intel.com> Cc: Gang Wei <gang.wei@intel.com> Cc: Shane Wang <shane.wang@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17mm/ia64: fix a memory block size bugJianguo Wu
I found following definition in include/linux/memory.h, in my IA64 platform, SECTION_SIZE_BITS is equal to 32, and MIN_MEMORY_BLOCK_SIZE will be 0. #define MIN_MEMORY_BLOCK_SIZE (1 << SECTION_SIZE_BITS) Because MIN_MEMORY_BLOCK_SIZE is int type and length of 32bits, so MIN_MEMORY_BLOCK_SIZE(1 << 32) will will equal to 0. Actually when SECTION_SIZE_BITS >= 31, MIN_MEMORY_BLOCK_SIZE will be wrong. This will cause wrong system memory infomation in sysfs. I think it should be: #define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS) And "echo offline > memory0/state" will cause following call trace: kernel BUG at mm/memory_hotplug.c:885! sh[6455]: bugcheck! 0 [1] Pid: 6455, CPU 0, comm: sh psr : 0000101008526030 ifs : 8000000000000fa4 ip : [<a0000001008c40f0>] Not tainted (3.6.0-rc1) ip is at offline_pages+0x210/0xee0 Call Trace: show_stack+0x80/0xa0 show_regs+0x640/0x920 die+0x190/0x2c0 die_if_kernel+0x50/0x80 ia64_bad_break+0x3d0/0x6e0 ia64_native_leave_kernel+0x0/0x270 offline_pages+0x210/0xee0 alloc_pages_current+0x180/0x2a0 Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17memory hotplug: reset pgdat->kswapd to NULL if creating kernel thread failsWen Congyang
If kthread_run() fails, pgdat->kswapd contains errno. When we stop this thread, we only check whether pgdat->kswapd is NULL and access it. If it contains errno, it will cause page fault. Reset pgdat->kswapd to NULL when creating kernel thread fails can avoid this problem. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17Merge tag 'rdma-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband Pull InfiniBand/RDMA fixes from Roland Dreier: - A couple more IPoIB fixes for regressions introduced by path database conversion - Minor other fixes to low-level drivers (cxgb4, mlx4, qib, ocrdma) * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: IB/qib: Fix failure of compliance test C14-024#06_LocalPortNum RDMA/ocrdma: Fix CQE expansion of unsignaled WQE mlx4_core: Fix integer overflows so 8TBs of memory registration works IPoIB: Fix AB-BA deadlock when deleting neighbours IPoIB: Fix memory leak in the neigh table deletion flow RDMA/cxgb4: Move dereference below NULL test
2012-09-17fs/proc: fix potential unregister_sysctl_table hangFrancesco Ruggeri
The unregister_sysctl_table() function hangs if all references to its ctl_table_header structure are not dropped. This can happen sometimes because of a leak in proc_sys_lookup(): proc_sys_lookup() gets a reference to the table via lookup_entry(), but it does not release it when a subsequent call to sysctl_follow_link() fails. This patch fixes this leak by making sure the reference is always dropped on return. See also commit 076c3eed2c31 ("sysctl: Rewrite proc_sys_lookup introducing find_entry and lookup_entry") which reorganized this code in 3.4. Tested in Linux 3.4.4. Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-16Linux 3.6-rc6Linus Torvalds
2012-09-16Merge tag 'mfd-for-linus-3.6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 Pull mfd fixes from Samuel Ortiz: "This is the remaining MFD fixes for 3.6, with 5 pending fixes: - A tps65217 build error fix. - A lcp_ich regression fix caused by the MFD driver failing to initialize the watchdog sub device due to ACPI conflicts. - 2 MAX77693 interrupt handling bug fixes. - An MFD core fix, adding an IRQ domain argument to the MFD device addition API in order to prevent silent and potentially harmful remapping behaviour changes for drivers supporting non-DT platforms." * tag 'mfd-for-linus-3.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: mfd: MAX77693: Fix NULL pointer error when initializing irqs mfd: MAX77693: Fix interrupt handling bug mfd: core: Push irqdomain mapping out into devices mfd: lpc_ich: Fix a 3.5 kernel regression for iTCO_wdt driver mfd: Move tps65217 regulator plat data handling to regulator