Настройки

Укажите год
-

Небесная энциклопедия

Космические корабли и станции, автоматические КА и методы их проектирования, бортовые комплексы управления, системы и средства жизнеобеспечения, особенности технологии производства ракетно-космических систем

Подробнее
-

Мониторинг СМИ

Мониторинг СМИ и социальных сетей. Сканирование интернета, новостных сайтов, специализированных контентных площадок на базе мессенджеров. Гибкие настройки фильтров и первоначальных источников.

Подробнее

Форма поиска

Поддерживает ввод нескольких поисковых фраз (по одной на строку). При поиске обеспечивает поддержку морфологии русского и английского языка
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Укажите год
Укажите год

Применить Всего найдено 707. Отображено 188.
28-10-2010 дата публикации

CHTUNG UND COMPUTERSYSTEM UM EINE SCHLEIFE IN EINEM PROGRAMM ZU COMPILIEREN

Номер: DE602005023651D1
Принадлежит: NXP BV, NXP B.V.

Подробнее
15-10-2010 дата публикации

COMPOSITION COMPILATION, COMPOSITION DEVICE AND COMPUTER SYSTEM AROUND A LOOP IN A PROGRAM FOR COMPILING

Номер: AT0000481678T
Принадлежит:

Подробнее
15-08-1995 дата публикации

RELAY PROCEDURE FOR THE EXECUTION OF INTERLOCKED LOOPS IN MULTIPROCESSOR COMPUTERS.

Номер: AT0000125966T
Принадлежит:

Подробнее
20-09-2007 дата публикации

GENERAL PURPOSE SOFTWARE PARALLEL TASK ENGINE

Номер: CA0002707680A1
Принадлежит:

A software engine for decomposing work to be done into tasks, and distributing the tasks to multiple, independent CPUs for execution is described. The engine utilizes dynamic code generation, with run-time specialization to variables, to achieve high performance. Problems are decomposed according to methods that enhance parallel CPU operation, and provide better opportunities for specialization and optimization of dynamically generated code. A specific application of this engine, a software three dimensional (3D) graphical image renderer, is described.

Подробнее
06-07-2018 дата публикации

Of the parallel program generating method and parallelizing compiler apparatus

Номер: CN0108255492A
Автор:
Принадлежит:

Подробнее
18-11-2015 дата публикации

Hardware and software solutions to divergent branches in a parallel pipeline

Номер: CN0105074657A
Автор: YAZDANI REZA
Принадлежит:

Подробнее
28-08-2015 дата публикации

METHOD FOR SECURING A PROGRAM

Номер: SG11201505428TA
Принадлежит:

Подробнее
01-11-1990 дата публикации

PROCESS FOR DETERMINING CYCLES REPLACEABLE BY VECTOR COMMANDS OF A VECTOR COMPUTER INSIDE COMMAND LOOPS OF PROGRAMS WRITTEN FOR VON NEUMANN COMPUTERS

Номер: WO1990013081A1
Автор: RÖSSIG, Stephan
Принадлежит:

When programs written for Von Neumann computers are to be used on supercomputers, e.g. vector computers, the operations in the program which can be executed in parallel must be executed in parallel in the vector computers in order to exploit the capacity of the vector computers. These situations can then be introduced into the program if it contains program loops. A program loop which contains no cycles can be vectorized, i.e., executed by vector commands. If a program loop contains cycles, the vector computer must contain special commands for executing these cycles in parallel. A special command of this type can, for example, be a command for forming the vector sum or for forming an arithmetic series. The command loops are inspected for the presence of cycles which can be replaced by these special commands. To this end, the control flow and data flow determined by the commands in the loop are inspected and control dependencies and data dependencies between the scalar presence of variables ...

Подробнее
30-07-1991 дата публикации

Horizontal computer having register multiconnect for execution of a loop with overlapped code

Номер: US0005036454A1
Принадлежит: Hewlett-Packard Company

A horizontal computer for execution of an instruction loop with overlapped code. The computer includes a plurality of processors, a multiconnect unit for storing operands for the processors, an instruction unit for specifying address offsets and operations to be performed by the processors, and an invariant address unit for combining the address offsets with a modifiable pointer to form source and destination addresses in the multiconnect unit. The instruction unit enables different ones of the processors as a function of which iteration of the loop is being executed, for example by means of processor control circuitry or by selectively providing instructions to the processors, so that different operations are performed during different iterations.

Подробнее
30-08-2011 дата публикации

Parallel programming interface to dynamically allocate program portions

Номер: US0008010954B2

A computing device-implemented method includes receiving a program created by a technical computing environment, analyzing the program, generating multiple program portions based on the analysis of the program, dynamically allocating the multiple program portions to multiple software units of execution for parallel programming, receiving multiple results associated with the multiple program portions from the multiple software units of execution, and providing the multiple results or a single result to the program.

Подробнее
20-08-2013 дата публикации

Multiversioning if statement merging and loop fusion

Номер: US0008516468B2

In one embodiment of the invention, a method for fusing a first loop nested in a first IF statement with a second loop nested in a second IF statement without the use of modified and referenced (mod-ref) information to determine if certain conditional statements in the IF statements retain variable values.

Подробнее
06-03-2018 дата публикации

Method and apparatus for approximating detection of overlaps between memory ranges

Номер: US0009910650B2
Принадлежит: Intel Corporation, INTEL CORP

A computer-implemented method for managing loop code in a compiler includes using a conflict detection procedure that detects across-iteration dependency for arrays of single memory addresses to determine whether a potential across-iteration dependency exists for arrays of memory addresses for ranges of memory accessed by the loop code.

Подробнее
22-06-2023 дата публикации

APPARATUS AND METHOD WITH NEURAL NETWORK COMPUTATION SCHEDULING

Номер: US20230195439A1
Автор: Bernhard EGGER, Hyemi MIN

An apparatus includes a processor configured to generate each of intermediate representation codes corresponding to each of a plurality of loop structures obtained that corresponds to a neural network computation based on an input specification file of hardware; schedule instructions included in each of the intermediate representation codes corresponding to the plurality of loop structures; select, based on latency values predicted according to scheduling results of the intermediate representation codes, any one code among the intermediate representation codes; and allocate, based on a scheduling result of the selected intermediate representation code, instructions included in the selected intermediate representation code to resources of the hardware included in the apparatus.

Подробнее
11-05-2022 дата публикации

Способ параллельного программирования

Номер: RU2771739C1

Изобретение относится к области вычислительной техники. Техническим результатом является ускорение выполнения программы в многоядерных вычислительных системах. Раскрыт способ параллельного программирования при котором происходит автоматический выбор средств распараллеливания во время исполнения программ в вычислительной системе, содержащей соединенные интерфейсом главную систему и устройство назначения, каждое из которых содержит многоядерный центральный процессор, память, кэш-инструкции, и выполняющей следующие операции: в главной системе формируют исходный код программы; формируют отчет времени выполнения циклических участков программы по тестовому прогону и сохраняют в памяти; проводят анализ программного кода на основании тестовых прогонов, определяют и номеруют наиболее время затратные циклические участки; модифицируют исходный программный код, при этом помещают в него дополнительные метки начала и конца время затратных циклических участков и применяемых средств распараллеливания на ...

Подробнее
18-08-2004 дата публикации

Scheduling of consumer and producer instructions in a processor with parallel execution units.

Номер: GB2398412A
Принадлежит: PTS Corp

A method of scheduling consumer instructions (c1 & c2) requiring a value produced by a producer instruction (p1) to execution units in a processor having a plurality of execution units comprises scheduling a consumer instruction in a loop kernel block before scheduling a producer instruction using a compiler. In operation the consumer instruction is allocated to a first execution unit before the producer instruction is allocated to a second execution unit. The selected execution unit may be the closest available execution to the first execution unit or in one embodiment the same execution unit. Preferably an interface block is used to create an interface between the basic and loop kernel block. Within the interface block a dummy instruction may be created by the scheduling of the consumer instruction to guide the scheduling of the producer instruction.

Подробнее
16-10-2017 дата публикации

Program loop control

Номер: TW0201737060A
Принадлежит:

A data processing system supports a predicated-loop instruction that controls vectorised execution of a program loop body in respect of a plurality of vector elements. When the number of elements to be processed is not a whole number multiple of the number of lanes of processing supported for that element size, then the predicated-loop instruction controls suppression of processing in one or more lanes not required.

Подробнее
25-05-2021 дата публикации

Neural network operation reordering for parallel execution

Номер: US0011016775B2
Принадлежит: Amazon Technologies, Inc., AMAZON TECH INC

Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.

Подробнее
04-06-2015 дата публикации

METHODS TO OPTIMIZE A PROGRAM LOOP VIA VECTOR INSTRUCTIONS USING A SHUFFLE TABLE

Номер: US20150154008A1
Принадлежит:

According to one embodiment, a code optimizer is configured to receive first code having a program loop implemented with scalar instructions to store values of a first array to a second array based on values of a third array and to generate second code representing the program loop using at least one vector instruction. The second code include a shuffle instruction to shuffle elements of the first array based on the third array using a shuffle table in a vector manner and a store instruction to store the shuffled elements of the first array in the second array.

Подробнее
26-03-2015 дата публикации

Predicate Vector Pack and Unpack Instructions

Номер: US20150089189A1
Принадлежит: APPLE INC.

In an embodiment, a processor may implement a vector instruction set including predicate vectors and multiple vector element sizes. The vector instruction set may include predicate vector pack and unpack instructions. Responsive to the predicate vector pack instruction, the processor may pack predicates from multiple predicate vector source registers into a destination predicate vector register. Responsive to the predicate vector unpack instruction, the processor may select a portion of a source predicate vector register and write the result to a destination predicate vector register. Additionally, the predicate vector register may store one or more vector attributes associated with the corresponding vector. The processor may modify the attribute as part of the pack/unpack operation (e.g. based on a pack/unpack factor). Additionally, vector pack/unpack instructions that are controlled by the attribute in a corresponding predicate vector register may be implemented.

Подробнее
30-10-2013 дата публикации

LOOP PARALLELIZATION BASED ON LOOP SPLITTING OR INDEX ARRAY

Номер: EP2656204A2
Принадлежит:

Подробнее
16-04-2014 дата публикации

Interleaving data accesses issued in response to vector access instructions

Номер: GB0201403770D0
Автор:
Принадлежит:

Подробнее
07-02-2014 дата публикации

METHOD FOR OPTIMIZING PARALLEL PROCESSING OF DATA ON A HARDWARE PLATFORM

Номер: FR0002985824B1
Принадлежит: THALES

Подробнее
16-11-2013 дата публикации

Efficient implementation of RSA using GPU/CPU architecture

Номер: TW0201346830A
Принадлежит:

Various embodiments are directed to a heterogeneous processor architecture comprised of a CPU and a GPU on the same processor die. The heterogeneous processor architecture may optimize source code in a GPU compiler using vector strip mining to reduce instructions of arbitrary vector lengths into GPU supported vector lengths and loop peeling. It may be first determined that the source code is eligible for optimization if more than one machine code instruction of compiled source code under-utilizes GPU instruction bandwidth limitations. The initial vector strip mining results may be discarded and the first iteration of the inner loop body may be peeled out of the loop. The type of operands in the source code may be lowered and the peeled out inner loop body of source code may be vector strip mined again to obtain optimized source code.

Подробнее
27-05-2010 дата публикации

SYSTEMS, METHODS, AND APPARATUSES TO DECOMPOSE A SEQUENTIAL PROGRAM INTO MULTIPLE THREADS, EXECUTE SAID THREADS, AND RECONSTRUCT THE SEQUENTIAL EXECUTION

Номер: WO2010060084A3
Принадлежит:

Systems, methods, and apparatuses for decomposing a sequential program into multiple threads, executing these threads, and reconstructing the sequential execution of the threads are described. A plurality of data cache units (DCUs) store locally retired instructions of speculatively executed threads. A merging level cache (MLC) merges data from the lines of the DCUs. An inter-core memory coherency module (ICMC) globally retire instructions of the speculatively executed threads in the MLC.

Подробнее
08-01-2004 дата публикации

Apparatus and method for implementing adjacent, non-unit stride memory access patterns utilizing SIMD instructions

Номер: US20040006667A1
Автор: Aart Bik, Milind Girkar
Принадлежит:

An apparatus and method for implementing adjacent, single non-unit stride memory access patterns are described. In one embodiment, the method includes compiler analysis of a source program to detect vectorizable loops having serial code statements that collectively perform adjacent, non-unit stride memory access. Once a vectorizable loop containing code statements that collectively perform adjacent, non-unit stride memory access in detected, the compiler vectorizes the serial code statements of the detected loop to perform the adjacent, non-unit stride memory access utilizing SIMD instructions. As such, the compiler repeats the analysis and vectorization for each vectorizable loop within the source program code.

Подробнее
20-09-2016 дата публикации

Extracting system architecture in high level synthesis

Номер: US0009449131B2
Принадлежит: XILINX, INC., XILINX INC, Xilinx, Inc.

Extracting a system architecture in high level synthesis includes determining a first function of a high level programming language description and a second function contained within a control flow construct of the high level programming description. The second function is determined to be a data consuming function of the first function. Within a circuit design, a port including a local memory is automatically generated. The port couples a first circuit block implementation of the first function to a second circuit block implementation of the second function within the circuit design.

Подробнее
22-05-2018 дата публикации

Technologies for optimizing sparse matrix code with field-programmable gate arrays

Номер: US0009977663B2
Принадлежит: Intel Corporation, INTEL CORP

Technologies for optimizing sparse matrix code include a target computing device having a processor and a field-programmable gate array (FPGA). A compiler identifies a performance-critical loop in a sparse matrix source code and generates optimized executable code, including processor code and FPGA code. The target computing device executes the optimized executable code, using the processor for the processor code and the FPGA for the FPGA code. The processor executes a first iteration of the loop, generates reusable optimization data in response to executing the first iteration, and stores the reusable optimization data in a shared memory. The FPGA accesses the optimization data in the shared memory, executes additional iterations of the loop, and optimizes the additional iterations of the loop based on the optimization data. The optimization data may include, for example, loop-invariant data, reordered data, or alternate data storage representations. Other embodiments are described and ...

Подробнее
13-05-2014 дата публикации

Pipelined loop parallelization with pre-computations

Номер: US0008726251B2

Embodiments of the invention provide systems and methods for automatically parallelizing loops with non-speculative pipelined execution of chunks of iterations with pre-computation of selected values. Non-DOALL loops are identified and divided the loops into chunks. The chunks are assigned to separate logical threads, which may be further assigned to hardware threads. As a thread performs its runtime computations, subsequent threads attempt to pre-compute their respective chunks of the loop. These pre-computations may result in a set of assumed initial values and pre-computed final variable values associated with each chunk. As subsequent pre-computed chunks are reached at runtime, those assumed initial values can be verified to determine whether to proceed with runtime computation of the chunk or to avoid runtime execution and instead use the pre-computed final variable values.

Подробнее
14-05-2009 дата публикации

OPTIMUM CODE GENERATION METHOD FOR MULTIPROCESSOR, AND COMPILING DEVICE

Номер: JP2009104422A
Принадлежит:

PROBLEM TO BE SOLVED: To provide a method for generating an adequate parallel code from a source code to a computer system composed of a plurality of processors which share a cache memory or a main memory. SOLUTION: The following procedures are executed: a procedure (406) which reads a preliminarily set code and analyzes an operation amount and contents of processing from the code concerned while dependence and independence are being distinguished, a procedure (407) which analyzes the quantity of data which are reused between processings, and a procedure (408) which analyzes the volume of data in accessing the main memory. Then, a procedure (409) is executed, which receives a parallel code generating plan (412) input by a user, divides the processing of the code, and finds a parallelization method which makes an execution cycle shortest while forecasting an execution cycle from the operation amount, the contents of the processing, the cache usage of the reused data, and the main memory ...

Подробнее
11-06-2014 дата публикации

Interleaving data accesses issued in response to vector access instructions

Номер: GB0002508751A
Принадлежит:

A vector data access unit for accessing data stored within a data store in response to decoded vector data access instructions is disclosed. Each of the vector data access instructions comprise a plurality of elements indicating a data access to be performed, the elements being in an order within the vector data access instruction that the corresponding data access is instructed to be performed in. The vector data access unit comprises data access ordering circuitry for issuing data access requests indicated by the elements to the data store, the data access ordering circuitry being configured in response to receipt of at least two decoded vector data access instructions, an earlier of the at least two decoded vector data access instructions being received before a later of the at least two decoded vector instructions and one of the at least two decoded vector data access instructions being a write instruction and to an indication that data accesses from the at least two decoded vector ...

Подробнее
19-12-2001 дата публикации

Predicated execution of instructions in processors

Номер: GB2363480A
Принадлежит:

A processor, operable to execute instructions on a predicated basis, includes a series of predicate registers (135), a control information holding unit (131) and a plurality of operating units (133). Each predicate register (135) is switchable between at least two states and each is assignable to one or more predicated-execution instructions. The control information holding unit (131) holds items of control information which correspond respectively to the predicate registers, and each operating unit also corresponds individually to one of the predicate registers. Each operating unit receives the control-information items corresponding to its own corresponding predicate register and a further one of the predicate registers, to determine the state of its own predicate register. In one embodiment, the operating units are operable in parallel with one another to perform respective such state determining operations. The state determining operations can be used to bring about state changes required ...

Подробнее
05-06-2002 дата публикации

Predicated execution of instructions in processors

Номер: GB0002367406B
Принадлежит: SIROYAN LTD, * SIROYAN LIMITED

Подробнее
18-08-2004 дата публикации

Scheduling of consumer and producer instructions in a processor with parallel execution units.

Номер: GB2398411A
Принадлежит:

A method of scheduling consumer instructions (c1 and c2) requiring a value produced by a producer instruction (p1) to execution units in a processor having a plurality of execution units comprises scheduling a consumer instruction in a loop kernel block before scheduling a producer instruction using a compiler. In operation the consumer instruction is allocated to a first execution unit before the producer instruction is allocated to a second execution unit. The scheduling of the producer instruction also requires the creation of a move instruction (mv) to create an availability chain such that a value is moved from a first point accessible by the basic block to a second point accessible by the loop block. The point may be a register file accessible by one of the execution units. Preferably an interface block is used to create an interface between the basic and loop kernel block. Within the interface block a dummy instruction may be created by the scheduling of the consumer instruction ...

Подробнее
15-02-2023 дата публикации

Techniques for parallel execution

Номер: GB0002609700A
Принадлежит:

Identification of instructions for advanced execution, the instructions that have been identified by a compiler to be speculatively performed in parallel. The instructions may be identified based on copy operations and the instructions may be performed in response to receiving a command from another processor. The command may be a kernel launch command from a host computer system. The instructions may implement a portion of an inferencing operation using a recurrent neural network.

Подробнее
01-11-2018 дата публикации

Method and system for automated improvement of parallelism in program compilation

Номер: AU2013290313B2
Принадлежит: Phillips Ormonde Fitzpatrick

A method of program compilation to improve parallelism during the linking of the program by a compiler. The method includes converting statements of the program to canonical form, constructing abstract system tree (AST) for each procedure in the program, and traversing the program to construct a graph by making each non-control flow statement and each control structure into at least one node of the graph.

Подробнее
09-11-2010 дата публикации

GENERAL PURPOSE SOFTWARE PARALLEL TASK ENGINE

Номер: CA0002638453C

A software engine for decomposing work to be done into tasks, and distributing the tasks to multiple, independent CPUs for execution is described. The engine utilizes dynamic code generation, with run-time specialization of variables, to achieve high performance. Problems are decomposed according to methods that enhance parallel CPU operation, and provide better opportunities for specialization and optimization of dynamically generated code. A specific application of this engine, a software three dimensional (3D) graphical image renderer, is described.

Подробнее
22-04-2015 дата публикации

CODE VERSIONING FOR ENABLING TRANSACTIONAL MEMORY REGION PROMOTION

Номер: CA0002830605A1
Принадлежит: BERESKIN & PARR LLP/S.E.N.C.R.L.,S.R.L.

An illustrative embodiment of a computer-implemented process for a computer-implemented process for code versioning for enabling transactional memory region promotion receives a portion of candidate source code and outlines the portion of candidate source code received for parallel execution. The computer-implemented process further wraps a critical region with entry and exit routines to enter into a speculation sub-process, wherein the entry and exit routines also gather conflict statistics at runtime. The outlined code portion is executed to determine to use a particular one of multiple loop versions according to the conflict statistics gathered at run time.

Подробнее
17-08-2016 дата публикации

Hierarchical loop instruction

Номер: CN0103530088B
Автор:
Принадлежит:

Подробнее
19-07-2013 дата публикации

PROCESS Of PARALLEL OPTIMIZATION OF TREATMENT OF DATA ON a MATERIAL PLATFORM

Номер: FR0002985824A1
Принадлежит: THALES

L'invention concerne un procédé d'optimisation de traitement parallèle de données sur une plateforme matérielle comprenant au moins une unité de calcul comprenant une pluralité d'unités de traitement aptes à exécuter en parallèle une pluralité de tâches exécutables, dans lequel l'ensemble de données à traiter est décomposé en sous-ensembles de données, une même suite d'opérations étant effectuée sur chaque sous-ensemble de données. Le procédé de l'invention comprend l'obtention (50, 52) du nombre maximal de sous-ensembles de données à traiter par une même suite d'opérations, et d'un nombre maximal de tâches exécutables en parallèle par une unité de calcul de la plateforme matérielle, la détermination (54) d'au moins deux découpages de traitement, chaque découpage de traitement correspondant au découpage de l'ensemble de données en un nombre de groupes de données, et à l'assignation d'au moins une tâche exécutable, apte à exécuter ladite suite d'opérations, à chaque sous-ensemble de données dudit groupe de données, et la sélection (60, 62) du découpage de traitement permettant d'obtenir une valeur de mesure optimale selon un critère prédéterminé. Des instructions de code de programmation mettant en oeuvre ledit découpage de traitement sélectionné sont alors obtenues. Une utilisation du procédé de l'invention est la sélection d'une plateforme matérielle optimale selon une mesure de performance d'exécution. The invention relates to a method of optimizing parallel processing of data on a hardware platform comprising at least one calculation unit comprising a plurality of processing units able to execute in parallel a plurality of executable tasks, in which the set of data to be processed is broken down into subsets of data, the same sequence of operations being performed on each subset of data. The method of the invention comprises obtaining (50, 52) the maximum number of subsets of data to be ...

Подробнее
02-03-2006 дата публикации

Method and system for auto parallelization of zero-trip loops through induction variable substitution

Номер: US2006048119A1
Принадлежит:

A method and system of auto parallelization of zero-trip loops that substitutes a nested basic linear induction variable by exploiting a parallelizing compiler is provided. Provided is a use of a max{0,N} variable for loop iterations in case of no information is known about the value of N, for a typical loop iterating from 1 to N, in which N is the loop invariant. For the nested basic induction variables, an induction variable substitution process is applied to the nested loops starting from the innermost loop to the outermost one. Then a removal of the max operator afterwards through a copy propagation pass of the IBM compiler is provided. In doing so, the loop dependency on the induction variable is eliminated and an opportunity for a parallelizing compiler to parallel the outermost loop is provided.

Подробнее
10-05-2005 дата публикации

Method for software pipelining of irregular conditional control loops

Номер: US0006892380B2

A method for software pipelining of irregular conditional control loops including pre-processing the loops so they can be safely software pipelined. The pre-processing step ensures that each original instruction in the loop body can be over-executed as many times as necessary. During the pre-processing stage, each instruction in the loop body is processing in turn (N 4 ). If the instruction can be safely speculatively executed, it is left alone (N 6 ). If it could be safely speculatively executed except that it modifies registers that are live out of the loop, then the instruction can be pre-processed using predication or register copying (N 7 , N 8 , N 9 ). Otherwise, predication must be applied (N 10 ). Predication is the process of guarding an instruction. When the guard condition is true, the instruction executes as though it were unguarded. When the guard condition is false, the instruction is nullified.

Подробнее
15-02-2000 дата публикации

Method and apparatus for optimizing program loops containing omega-invariant statements

Номер: US6026240A
Автор:
Принадлежит:

Apparatus, methods, and computer program products are disclosed for optimizing programs containing single basic block natural loops with a determinable number of iterations. The invention optimizes, for execution speed, such program loops containing statements that are initially variant, but stabilize and become invariant after some number of iterations of the loop. The invention optimizes the loop by unwinding iterations from the loop for which the statements are variant, and by hoisting the stabilized statement from subsequent iterations of the loop.

Подробнее
25-07-1995 дата публикации

Method of generating from source program object program by which final values of variables for parallel execution are guaranteed

Номер: US0005437034A
Автор:
Принадлежит:

In a method of generating an object program for a multiprocessor system from a source program including a loop, there is detected a variable in the loop. For the detected variable, first codes providing a one-dimensional work array are added to the source program. The work array has elements whose the number is predetermined according to a maximum number of parallel processes to be generated for the loop and is shared among the parallel processes. It is determined whether or not the variable is used outside the loop. When it is determined that the variable is not used at any position outside the loop, the source program with the first codes added is compiled to produce the object program, thereby executing the loop in a parallel fashion by the parallel processes using the elements of the work array as a local variable associated with the variable.

Подробнее
17-02-2005 дата публикации

Processors and compiling methods for processors

Номер: US2005039167A1
Автор:
Принадлежит:

A compiling method compiles an object program to be executed by a processor having a plurality of execution units operable in parallel. In the method a first availability chain is created from a producer instruction (p1), scheduled for execution by a first one of the execution units (20: AGU), to a first consumer instruction (c1), scheduled for execution by a second one of the execution units (22: EXU) and requiring a value produced by the said producer instruction. The first availability chain comprises at least one move instruction (mv1-mv3) for moving the required value from a first point (20: ARF) accessible by the first execution unit to a second point (22: DRF) accessible by the second execution unit. When a second consumer instruction (c2), also requiring the same value, is scheduled for execution by an execution unit (23: EXU) other than the first execution unit, at least part of the first availability chain is reused to move the required value to a point (23: DRF) accessible by ...

Подробнее
29-10-2020 дата публикации

COMPILATION TO REDUCE NUMBER OF INSTRUCTIONS FOR DEEP LEARNING PROCESSOR

Номер: US20200341765A1
Принадлежит:

A method performed during execution of a compilation process for a program having nested loops is provided. The method replaces multiple conditional branch instructions for a processor which uses a conditional branch instruction limited to only comparing a value of a general register with a value of a special register that holds a loop counter value. The method generates, in replacement of the multiple conditional branch instructions, the conditional branch instruction limited to only comparing the value of the general register with the value of the special register that holds the loop counter value for the inner-most loop. The method adds (i) a register initialization outside the nested loops and (ii) a register value adjustment to the inner-most loop. The method defines the value for the general register for the register initialization and conditions for the generated conditional branch instruction, responsive to requirements of the multiple conditional branch instructions.

Подробнее
05-11-1999 дата публикации

VECTOR REGISTER CONTROL PROCESSOR AND RECORDING MEDIUM

Номер: JP0011306167A
Принадлежит:

PROBLEM TO BE SOLVED: To improve both translation performance and execution performance by vectorizing a multiplex loop of a source program and taking out the invariable vector data to the outside of the loop even when the loop is executed in a vectorized main body part. SOLUTION: An optimization means 4 consists of a vectorization means 5 which optimizes the vectorization based on the analysis result of a source program analysis means 3 and an optimization execution means 6. Then a vector register allocation means 7 of the means 6 vectorizes a multiplex loop of a source program 1 and allocates a vector register. A vector register control processing means 8 takes out the invariable vector data to the front or back part of a vectorized main body even when a loop is executed in this main body part. As a result, both translation performance and execution performance can be improved. COPYRIGHT: (C)1999,JPO ...

Подробнее
22-08-2000 дата публикации

A SYSTEM AND METHOD FOR OPTIMIZING PROGRAM EXECUTION IN A COMPUTER SYSTEM

Номер: CA0002262277A1
Принадлежит:

A method, computer system and article of manufacture for optimizing a computer program, the method comprising the steps of executing an application program and profiling selected loops of the executing program. Characteristics of the profiled loops are then compared to corresponding predetermined threshold values and the results of the comparison are used to select an optimization to be applied to subsequent execution of the selected loops. In a preferred embodiment, the optimization is the selection of either a parallel version or a serial version of the loop. Further embodiments provide for the selection of the number of processors for parallel implemented loops and for the selection of an unroll factor in serially implemented loops.

Подробнее
11-12-2008 дата публикации

PARALLELIZING SEQUENTIAL FRAMEWORKS USING TRANSACTIONS

Номер: WO000002008151045A1
Принадлежит: MICROSOFT CORPORATION

Various technologies and techniques are disclosed for transforming a sequential loop into a parallel loop for use with a transactional memory system. A transactional memory system is provided. A first section of code containing an original sequential loop is transformed into a second section of code containing a parallel loop that uses transactions to preserve an original input to output mapping. For example, the original sequential loop can be transformed into a parallel loop by taking each iteration of the original sequential loop and generating a separate transaction that follows a pre-determined commit order process. At least some of the separate transactions are executed in different threads. When an unhandled exception is detected that occurs in a particular transaction while the parallel loop is executing, state modifications made by the particular transaction and predecessor transactions are committed, and state modifications made by successor transactions are discarded.

Подробнее
20-02-2014 дата публикации

PARALLEL MEMORY SYSTEMS

Номер: US20140052961A1
Принадлежит:

The invention relates to a multi-core processor memory system, wherein it is provided that the system comprises memory channels between the multi-core processor and the system memory, and that the system comprises at least as many memory channels as processor cores, each memory channel being dedicated to a processor core, and that the memory system relates at run-time dynamically memory blocks dedicatedly to the accessing core, the accessing core having dedicated access to the memory bank via the memory channel.

Подробнее
06-06-2024 дата публикации

VECTORIZING A LOOP

Номер: US20240184554A1
Автор: Gil Rapaport, Ayal Zaks
Принадлежит:

A method includes: receiving input code that comprises a loop that operates on a first array of elements and a second array of elements, wherein during an iteration of the loop a first operation is performed on an element of the first array of elements, or a second operation is performed on an element of the second array of elements; generating a first compound operation that operates on a predetermined number of elements of the first array of elements, the first compound operation resulting in a first intermediate vector; generating a second compound operation that operates on the predetermined number of elements of the second array of elements, the second compound operation resulting in a second intermediate vector; interleaving the first intermediate vector and the second intermediate vector and storing the interleaved result in a temporary vector; and summing the interleaved result in the temporary vector using an order-preserving sum.

Подробнее
25-04-1995 дата публикации

OPTIMIZED PARALLEL COMPILING DEVICE AND OPTIMIZED PARALLEL COMPILING METHOD

Номер: JP0007110800A
Автор: ZAIKI KOUJI
Принадлежит:

PURPOSE: To provide optimized compiling device and method for minimizing a data transfer number at the time of parallelizing program loops and a program converter using them. CONSTITUTION: A source program inputted from an input means 201 is converted to an intermediate code in an intermediate code generation means 202 and a loop and a variable referred to in the 100P are detected from the intermediate code by a loop detection means 203 and a reference variable detection means 204. Further, the data transfer number to be required by the parallelization of the loops is calculated for the respective parallellization object loops by a data transfer number detection means 207. A parallelization judgement means 209 decides the parallelization loop whose data transfer number is the minimum and parallelizes the loop. COPYRIGHT: (C)1995,JPO ...

Подробнее
23-10-2019 дата публикации

Program loop control

Номер: GB0002548602B
Принадлежит: ADVANCED RISC MACH LTD, ARM Limited

Подробнее
05-08-2013 дата публикации

SYSTEMS, METHODS, AND APPARATUSES TO DECOMPOSE A SEQUENTIAL PROGRAM INTO MULTIPLE THREADS, EXECUTE SAID THREADS, AND RECONSTRUCT THE SEQUENTIAL EXECUTION

Номер: KR0101292439B1
Принадлежит: 인텔 코포레이션

순차적 프로그램을 다수의 스레드로 분해하고, 이러한 스레드를 실행하며, 스레드의 순차적 실행을 재구성하는 시스템, 방법 및 장치가 설명된다. 복수의 데이터 캐시 유닛(DCU)은 추론적으로 실행된 스레드의 국부적으로 퇴거된 인스트럭션을 저장한다. 병합 레벨 캐시(MLC)는 DCU의 라인으로부터의 데이터를 병합한다. 코어간 메모리 일관성 모듈(ICMC)은 MLC에서 추론적으로 실행된 스레드의 인스트럭션을 전역적으로 퇴거시킨다. Systems, methods, and apparatus are described that decompose sequential programs into multiple threads, execute such threads, and reconstruct sequential execution of threads. The plurality of data cache units (DCUs) store locally retired instructions of speculatively executed threads. The merge level cache (MLC) merges data from the lines of the DCU. The intercore memory coherency module (ICMC) retires globally the instructions of a thread speculatively executed in MLC.

Подробнее
17-05-2016 дата публикации

método para paralelizar um loop de programa com dependência loop-carried

Номер: BR102014023779A2
Принадлежит:

Подробнее
15-12-2016 дата публикации

GENERATING OBJECT CODE FROM INTERMEDIATE CODE THAT INCLUDES HIERARCHICAL SUB-ROUTINE INFORMATION

Номер: US20160364216A1
Автор: Lee Howes, HOWES LEE, Howes Lee
Принадлежит:

Examples are described for a device to receive intermediate code that was generated from compiling source code of an application. The intermediate code includes information generated from the compiling that identifies a hierarchical structure of lower level sub-routines in higher level sub-routines, and the lower level sub-routines are defined in the source code of the application to execute more frequently than the higher level sub-routines that identify the lower level sub-routines. The device is configured to compile the intermediate code to generate object code based on the information that identifies lower level sub-routines in higher level sub-routines, and store the object code.

Подробнее
19-02-2013 дата публикации

Insertion of multithreaded execution synchronization points in a software program

Номер: US0008381203B1

A compiler is configured to determine a set of points in a flow graph for a software program where multithreaded execution synchronization points are inserted to synchronize divergent threads for SIMD processing. MIMD execution of divergent threads is allowed and execution of the divergent threads proceeds until a synchronization point is reached. When all of the threads reach the synchronization point, synchronous execution resumes. The synchronization points are needed to ensure proper execution of the certain instructions that require synchronous execution as defined in some graphics APIs and when synchronous execution improves performance based on a SIMD architecture.

Подробнее
26-05-2015 дата публикации

Optimization of loops and data flow sections in multi-core processor environment

Номер: US0009043769B2
Автор: Martin Vorbach
Принадлежит: Hyperion Core Inc.

The present invention relates to a method for compiling code for a multi-core processor, comprising: detecting and optimizing a loop, partitioning the loop into partitions executable and mappable on physical hardware with optimal instruction level parallelism, optimizing the loop iterations and/or loop counter for ideal mapping on hardware, chaining the loop partitions generating a list representing the execution sequence of the partitions.

Подробнее
26-09-2019 дата публикации

OPTIMIZE CONTROL-FLOW CONVERGENCE ON SIMD ENGINE USING DIVERGENCE DEPTH

Номер: US20190294444A1
Принадлежит:

There are provided a system, a method and a computer program product for selecting an active data stream (a lane) while running Single Program Multiple Data code on a Single Instruction Multiple Data machine. The machine runs an instruction stream over input data streams and machine increments lane depth counters of all active lanes upon the thread-PC reaching a branch operation and updates the lane-PC of each active lane according to targets of the branch operation. An instruction of the instruction stream includes a barrier indicating a convergence point for all lanes to join. In response to a lane reaching a barrier: evaluating whether all lane-PCs are set to a same thread-PC; and if the lane-PCs are not set to the same thread-PC, selecting an active lane from the plurality of lanes; otherwise, incrementing the lane-PCs of all the lanes, and then selecting an active lane from the plurality of lanes.

Подробнее
27-05-2014 дата публикации

Parallelizing non-countable loops with hardware transactional memory

Номер: US0008739141B2

A system and method for speculatively parallelizing non-countable loops in a multi-threaded application. A multi-core processor receives instructions for a multi-threaded application. The application may contain non-countable loops. Non-countable loops have an iteration count value that cannot be determined prior to the execution of the non-countable loop, a loop index value that cannot be non-speculatively determined prior to the execution of an iteration of the non-countable loop, and control that is not transferred out of the loop body by a code line in the loop body. The compiler replaces the non-countable loop with a parallelized loop pattern that uses outlined function calls defined in a parallelization library (PL) in order to speculatively execute iterations of the parallelized loop. The parallelized loop pattern is configured to squash and re-execute any speculative thread of the parallelized loop pattern that is signaled to have a transaction failure.

Подробнее
14-06-2023 дата публикации

METHOD AND SYSTEM FOR AUTOMATED IMPROVEMENT OF PARALLELISM IN PROGRAM COMPILATION

Номер: EP2872989B1
Автор: Craymer, Loring
Принадлежит: Craymer, Loring

Подробнее
09-08-2000 дата публикации

Predicated execution of instructions in processors

Номер: GB0000014432D0
Автор:
Принадлежит:

Подробнее
28-09-2011 дата публикации

Modulus-scheduling-based compiling method and device for realizing circular instruction scheduling

Номер: CN0102200924A
Принадлежит:

The invention discloses a modulus-scheduling-based compiling method and device for realizing circular instruction scheduling. The method comprises the following steps which are executed by a compiler: reading and analyzing a source program to acquire control flow graph information; establishing data dependence restriction and resource dependence restriction of a loop body structure; and solving according to a back model in accordance with corresponding restriction regarding the data dependence conflict and/or resource conflict happening in detecting the instruction scheduling result in a process that the loop body structure executes the modulus scheduling. By adopting the method, data correlation of adjacent instructions in the loop body can be avoided, and the execution time of generating codes is reduced, so that the instruction-level parallelism can be effectively excavated, and the performance of a processor system even a computer system can be improved.

Подробнее
03-11-2015 дата публикации

프로그램을 보호하기 위한 방법

Номер: KR1020150123282A
Принадлежит:

... 제1 프로그램을 보호하는 방법으로서, 상기 제1 프로그램은 유한한 수의 프로그램 포인트들 및 상기 프로그램 포인트들에 연관된 진전 규칙들을 포함하고 한 프로그램 포인트로부터 다른 프로그램 포인트로의 경로를 정의하며, 상기 방법은: 복수의 종료 케이스의 정의 및, 제2 프로그램이 상기 제1 프로그램의 정의에 이용되는 경우, 상기 제2 프로그램의 각 종료 케이스에 대한, 상기 제1 프로그램의 특정 프로그램 포인트를 향한 분기의 정의 또는 분기 불능의 선언; 및 기 제1 프로그램의 하나 이상의 구성적인 요소들에 각각 연관된, 증명될 속성들의 집합의 정의 -상기 속성들의 집합은 특정한 속성으로서 상기 분기 불능을 포함-; 상기 속성들의 집합의 형식적인 증명의 설정을 포함하는, 방법.

Подробнее
20-07-2017 дата публикации

PROGRAM OPTIMIZATION BASED ON DIRECTIVES FOR INTERMEDIATE CODE

Номер: US20170206068A1
Принадлежит:

An optimization system to apply directives to a computer program without having to perform repeated front-end compilations of source code of the computer program is provided. In some embodiments, the optimization system performs a first compilation of the source code of the program to generate first front-end code and first back-end code of the computer program. The compilation includes a first front-end compilation and a first back-end compilation. The optimization system identifies a compiler directive to apply to a location within the first front-end code. The optimization system then performs a second back-end compilation of the first front-end code factoring in the compiler directive to generate second back-end code affected by the compiler directive.

Подробнее
25-06-2019 дата публикации

Optimization of loops and data flow sections in multi-core processor environment

Номер: US0010331615B2
Принадлежит: Hyperion Core, Inc., HYPERION CORE INC

The present invention relates to a method for compiling code for a multi-core processor, comprising: detecting and optimizing a loop, partitioning the loop into partitions executable and mappable on physical hardware with optimal instruction level parallelism, optimizing the loop iterations and/or loop counter for ideal mapping on hardware, chaining the loop partitions generating a list representing the execution sequence of the partitions.

Подробнее
12-08-2014 дата публикации

Program generation device, program production method, and program

Номер: US0008806466B2
Принадлежит: Panasonic Corporation

A program generation apparatus references a source program including a loop for executing a block N times (N≧2) and having such dependence that a variable defined in a statement in the block pertaining to ith execution (1≦i Подробнее

30-04-2020 дата публикации

AUTOMATIC GENERATION OF MULTI-SOURCE BREADTH-FIRST SEARCH FROM HIGH-LEVEL GRAPH LANGUAGE FOR DISTRIBUTED GRAPH PROCESSING SYSTEMS

Номер: US20200133663A1
Принадлежит:

Techniques are described herein for automatic generation of multi-source breadth-first search (MS-BFS) from high-level graph processing language that can be executed in a distributed computing environment. In an embodiment, a method involves a computer analyzing original software instructions. The original software instructions are configured to perform multiple breadth-first searches to determine a particular result. Each breadth-first search originates at each of a subset of vertices of a graph. Each breadth-first search is encoded for independent execution. Based on the analyzing, the computer generates transformed software instructions configured to perform a MS-BFS to determine the particular result. Each of the subset of vertices is a source of the MS-BFS. In an embodiment, the second plurality of software instructions comprises a node iteration loop and a neighbor iteration loop, and the plurality of vertices of the distributed graph comprise active vertices and neighbor vertices. The node iteration loop is configured to iterate once per each active vertex of the plurality of vertices of the distributed graph, and the node iteration loop is configured to determine the particular result. The neighbor iteration loop is configured to iterate once per each active vertex of the plurality of vertices of the distributed graph, and each iteration of the neighbor iteration loop is configured to activate one or more neighbor vertices of the plurality of vertices for the following iteration of the neighbor iteration loop. 1. A method comprising:analyzing a first plurality of software instructions, wherein the first plurality of software instructions is configured to perform a plurality of breadth-first searches to determine a particular result, wherein each breadth-first search originates at each of a plurality of vertices of a distributed graph, wherein each breadth-first search is encoded for independent execution;based on said analyzing, generating a second plurality ...

Подробнее
09-01-2001 дата публикации

Method of compiling a loop

Номер: US0006173443B1
Автор: Akiyoshi Wakatani
Принадлежит: Matsushita Electric Industrial Co Ltd

In a method of compiling, the contents of registers corresponding to data arrays having the same array names but having different indexes in sequence with the progress of a loop prior to loop return are moved, and only that having the smallest index among those which should be stored is stored. In this manner, the number of Load/Stores is reduced. Moreover, by unrolling loops, register moves may be omitted. Thus, by the application of the method of register allocation and changing the method of register allocation, execution of loops containing calculations of data arrays is speeded up by the extent of unnecessary memory accesses which have been eliminated.

Подробнее
19-12-2001 дата публикации

METHOD OF UPDATING PROGRAM AND COMMUNICATION TERMINAL

Номер: EP0001164471A2
Автор: Topham, Nigel Peter
Принадлежит:

A processor, operable to execute instructions on a predicated basis, includes a series of predicate registers (135), a control information holding unit (131) and a plurality of operating units (133). Each predicate register of the series (135) is switchable between at least respective first and second states and each is assignable to one or more predicated-execution instructions. The control information holding unit (131) holds items of control information which correspond respectively to the predicate registers, and each operating unit also corresponds individually to one of the predicate registers. Each operating unit has a first control input connected to the control information holding unit (131) for receiving the control-information item corresponding to its unit's own corresponding predicate register and also has a second control input connected for receiving the control-information item corresponding to a further one of the predicate registers. Each operating unit is operable to ...

Подробнее
14-10-2004 дата публикации

DETECTION METHOD AND SYSTEM OF REDUCTION VARIABLE IN ASSIGNMENT STATEMENT, AND PROGRAM PRODUCT

Номер: JP2004288163A
Автор: BERA RAJENDRA K
Принадлежит:

PROBLEM TO BE SOLVED: To provide a method, system and program product to detect reduction variables in assignment statements in source codes for enabling parallel execution of program loops. SOLUTION: The reduction variables are tagged to the respective loops and passed to a compiler through compiler directives for parallelizing a reduction operation along with information about each variable's respective associative operator. COPYRIGHT: (C)2005,JPO&NCIPI ...

Подробнее
29-06-1993 дата публикации

ECHELON METHOD FOR EXECUTION OF NESTED LOOPS IN MULTIPLE PROCESSOR COMPUTERS

Номер: CA0001319757C

A compiler for generating code for enabling multiple processors to process programs in parallel. The code enables the multiple processor system to operate in the following manner: one interation of an outer loop in a set of nested loops is assigned to each processor. If the outer loop contains more iterations than processor in the system, the processors are initially assigned an earlier iteration, and the remaining iterations are assigned to the processor one as they finish their earlier iterations. Each processor runs the inner loop iterations serially. In order to enforce dependencies in the loops, each processor reports its progress in its iterations of the inner loop to the processor executing the succeeding outer loop iteration and the waits until the processor computing the preceding outer loop is ahead or behind in processing its inner loop iteration by an amount which guarantees that dependencies will be enforced.

Подробнее
29-04-2015 дата публикации

Code versioning for enabling transactional memory promotion

Номер: CN104572260A
Принадлежит:

Подробнее
06-10-2015 дата публикации

병렬 파이프라인에서의 분기에 대한 하드웨어 및 소프트웨어 해법

Номер: KR1020150112017A
Автор: 야즈다니 레자
Принадлежит:

... 프로세서 내의 하드웨어 병렬 실행 레인에서 명령어의 효율적 처리를 위한 시스템 및 방법이 개시된다. 식별되는 루프 내 주어진 분기점에 응답하여, 컴파일러는, 식별되는 루프 내 명령어들을 VLIW(very large instruction world)로 배열한다. 적어도 하나의 VLIW는 주어진 분기점과, 대응하는 집중점 사이에서 서로 다른 기본 블록으로부터 섞인 명령어들을 가리킨다. 컴파일러는 실행될 때, 주어진 VLIW 내의 명령어를 런타임 시에 목표 프로세서 내의 복수의 병렬 실행 레인에 할당하는 코드를 발생시킨다. 목표 프로세서는 SIMD(single instruction multiple word) 마이크로구조를 포함한다. 주어진 레인에 대한 할당은 주어진 분기점에서 주어진 레인에 대해 런타임 시에 발견되는 분기 방향에 기초한다. 목표 프로세서는 연관된 레인이 실행할, 인출되는 VLIW 내의 주어진 명령어를 표시하는 표시사항을 저장하기 위한 벡터 레지스터를 포함한다.

Подробнее
12-11-2019 дата публикации

método para segurança de um primeiro programa, e, mídia legível por computador

Номер: BR112015020394A8
Автор: DOMINIQUE BOLIGNANO
Принадлежит:

Подробнее
28-06-2012 дата публикации

LOOP PARALLELIZATION BASED ON LOOP SPLITTING OR INDEX ARRAY

Номер: WO2012087988A2
Принадлежит:

Methods and apparatus to provide loop parallelization based on loop splitting and/or index array are described. In one embodiment, one or more split loops, corresponding to an original loop, are generated based on the mis-speculation information. In another embodiment, a plurality of subloops are generated from an original loop based on an index array. Other embodiments are also described.

Подробнее
16-01-2014 дата публикации

METHOD AND SYSTEM FOR AUTOMATED IMPROVEMENT OF PARALLELISM IN PROGRAM COMPILATION

Номер: WO2014011696A1
Автор: CRAYMER, Loring
Принадлежит:

A method of program compilation to improve parallelism during the linking of the program by a compiler. The method includes converting statements of the program to canonical form, constructing abstract system tree (AST) for each procedure in the program, and traversing the program to construct a graph by making each non-control flow statement and each control structure into at least one node of the graph.

Подробнее
07-05-1998 дата публикации

DATA DISTRIBUTION AND ARRANGEMENT DETERMINATION METHOD FOR PARALLEL COMPUTERS AND APPARATUS FOR THE METHOD

Номер: WO1998019249A1
Автор: OTA, Hiroshi
Принадлежит:

The sorting relation between the dimension of array and loops is determined first, a loop most appropriate as a distribution candidate is selected, and the distribution of array is determined in accordance with the selected loop. Consequently, the time taken to determine the sorting relation is shortened. The possibility that optimum sorting relation is finally employed is increased by leaving a plurality of sorting relation candidates for determining the sorting relation between the dimension of array and loops.

Подробнее
15-02-1994 дата публикации

Multitasking system for in-procedure loops

Номер: US0005287509A
Автор:
Принадлежит:

A system for multitasking inner loops, such as DO loops, using multiprocessors, provided with a plurality of shared registers each corresponding to one of a plurality of individual processors comprising the multiprocessor system. The plurality of shared registers store start and end values of segments resulting from dividing ranges of loop variables corresponding to the inner loops. The system for multitasking inner loops comprises an executing unit for iteratively executing the processing of the inner loops until the end value is reached. The system also comprises a decision unit for deciding whether or not there remain any unprocessed loops. Finally, the system comprises a continuing unit, responsive to the decision unit for continuing processing of the unprocessed loop or loops by transferring a part of a range which the loop variables corresponding to the unprocessed loop or loops can have.

Подробнее
06-10-1998 дата публикации

Method and apparatus for scheduling instructions for execution on a multi-issue architecture computer

Номер: US0005819088A
Автор: James R. Reinders
Принадлежит: Intel Corp

Improved parallelism in the generated schedules of basic blocks of a program being compiled is advantageously achieved by providing an improved scheduler to the code generator of a compiler targeting a multi-issue architecture computer. The improved scheduler implements the prior-art list scheduling technique with a number of improvements including differentiation of instructions into squeezed and non-squeezed instructions, employing priority functions that factor in the squeezed and non-squeezed instruction distinction for selecting a candidate instruction, tracking only the resources utilized by the non-squeezed instructions, and tracking the scheduling of the squeezed and non-squeezed instructions separately. When software pipelining is additionally employed to further increase parallelism in program loops, the improved scheduler factors only the non-squeezed instructions in the initial minimum schedule (initiation internal) size calculation.

Подробнее
29-10-2013 дата публикации

Methods and apparatus for joint parallelism and locality optimization in source code compilation

Номер: US0008572590B2

Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least two multi-stage execution units that avow for parallel execution of tasks. The first custom computing apparatus optimizes the code for both parallelism and locality of operations on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.

Подробнее
24-03-2004 дата публикации

Processors and compiling methods for processors

Номер: GB0000403623D0
Автор: [UNK]
Принадлежит: PTS Corp

Подробнее
05-12-2001 дата публикации

Processors and compiling methods for processors

Номер: GB0000124553D0
Автор:
Принадлежит:

Подробнее
20-09-2007 дата публикации

GENERAL PURPOSE SOFTWARE PARALLEL TASK ENGINE

Номер: CA0002638453A1
Принадлежит:

Подробнее
20-05-2003 дата публикации

A SYSTEM AND METHOD FOR OPTIMIZING PROGRAM EXECUTION IN A COMPUTER SYSTEM

Номер: CA0002262277C

A method, computer system and article of manufacture for optimizing a comput er program, the method comprising the steps of executing an application program and profiling selected loops of the executing program. Characteristics of the profiled loops are th en compared to corresponding predetermined threshold values and the results of the comparis on are used to select an optimization to be applied to subsequent execution of the selected loops. In a preferred embodiment, the optimization is the selection of either a parallel version o r a serial version of the loop. Further embodiments provide for the selection of the number of processors for parallel implemented loops and for the selection of an unroll factor in serially implemented loops.

Подробнее
28-05-2003 дата публикации

METHOD OF PARALLEL LOOP TRANSFORMATION FOR ON-THE-FLY RACE DETECTION IN PARALLEL PROGRAM

Номер: KR20030042319A
Принадлежит:

PURPOSE: A method of parallel loop transformation for on-the-fly race detection in a parallel program is provided to minimize watching objects for the on-the-fly race detection while operating the parallel program. CONSTITUTION: The information needed to modify the parallel loop through a static analysis for a loop body is extracted by using each parallel loop as input(S210). If the information needed to modify the parallel loop, the modified parallel loop is generated by using the extracted information and the parallel loop as the input(S220). After each statement in the parallel loop generates all loops including the race, a race detection function is set up(S230). The race is detected by executing the modified parallel loop program set the race detection function(S240). © KIPO 2003 ...

Подробнее
20-09-2012 дата публикации

PARALLEL MEMORY SYSTEMS

Номер: WO2012123061A1
Автор: VORBACH, Martin
Принадлежит:

The invention relates to a multi-core processor memory system, wherein it is provided that the system comprises memory channels between the multi-core processor and the system memory, and that the system comprises at least as many memory channels as processor cores, each memory channel being dedicated to a processor core, and that the memory system relates at run-time dynamically memory blocks dedicatedly to the accessing core, the accessing core having dedicated access to the memory bank via the memory channel.

Подробнее
03-11-1988 дата публикации

PARALLEL-PROCESSING SYSTEM EMPLOYING A HORIZONTAL ARCHITECTURE COMPRISING MULTIPLE PROCESSING ELEMENTS AND INTERCONNECT CIRCUIT WITH DELAY MEMORY ELEMENTS TO PROVIDE DATA PATHS BETWEEN THE PROCESSING ELEMENTS

Номер: WO1988008568A1
Принадлежит:

A computer system (3) including a processing unit (8) having one or more processors (32-1, 32-2, 32-3), for performing operations on input operands and providing output operands (11), a multiconnect unit (6) for storing operands at addressable locations (34-1, 34-2) and for providing said input operands (10-1) from source addresses and for storing said output operands with destination addresses, an instruction unit (9) for specifying operations to be performed by said processing unit (8), for specifying source address offsets and destination address offsets relative to a modifiable pointer, invariant addressing means (12) for providing said modifiable pointer and for combining said address offsets to form said source addresses and said destination addresses in said multiconnect unit (6).

Подробнее
17-02-2005 дата публикации

Processors and compiling methods for processors

Номер: US20050039167A1
Принадлежит:

A compiling method compiles an object program to be executed by a processor having a plurality of execution units operable in parallel. In the method a first availability chain is created from a producer instruction (p1), scheduled for execution by a first one of the execution units (20: AGU), to a first consumer instruction (c1), scheduled for execution by a second one of the execution units (22: EXU) and requiring a value produced by the said producer instruction. The first availability chain comprises at least one move instruction (mv1-mv3) for moving the required value from a first point (20: ARF) accessible by the first execution unit to a second point (22: DRF) accessible by the second execution unit. When a second consumer instruction (c2), also requiring the same value, is scheduled for execution by an execution unit (23: EXU) other than the first execution unit, at least part of the first availability chain is reused to move the required value to a point (23: DRF) accessible by ...

Подробнее
16-01-2014 дата публикации

Method and System for Automated Improvement of Parallelism in Program Compilation

Номер: US20140019949A1
Принадлежит:

A method of program compilation to improve parallelism during the linking of the program by a compiler. The method includes converting statements of the program to canonical form, constructing abstract system tree (AST) for each procedure in the program, and traversing the program to construct a graph by making each non-control flow statement and each control structure into at least one node of the graph.

Подробнее
31-01-2012 дата публикации

Parallel programming computing system to dynamically allocate program portions

Номер: US0008108845B2

A computing system receives a program created by a technical computing environment, analyzes the program, generates multiple program portions based on the analysis of the program, dynamically allocates the multiple program portions to multiple software units of execution for parallel programming, receives multiple results associated with the multiple program portions from the multiple software units of execution, and provides the multiple results or a single result to the program.

Подробнее
02-04-2019 дата публикации

Alternative loop limits for accessing data in multi-dimensional tensors

Номер: US0010248908B2
Принадлежит: Google LLC, GOOGLE LLC

Methods, systems, and apparatus for accessing a N-dimensional tensor are described. In some implementations, a method includes, for each of one or more first iterations of a first nested loop, performing iterations of a second nested loop that is nested within the first nested loop until a first loop bound for the second nested loop is reached. A number of iterations of the second nested loop for the one or more first iterations of the first nested loop is limited by the first loop bound in response to the second nested loop having a total number of iterations that exceeds a value of a hardware property of the computing system. After a penultimate iteration of the first nested loop has completed, one or more iterations of the second nested loop are performed for a final iteration of the first nested loop until an alternative loop bound is reached.

Подробнее
11-07-2002 дата публикации

Predicated execution of instructions in processors

Номер: US2002091996A1
Автор:
Принадлежит:

A processor, operable to execute instructions on a predicated basis, includes a series of predicate registers (135), a control information holding unit (131) and a plurality of operating units (133). Each predicate register of the series (135) is switchable between at least respective first and second states and each is assignable to one or more predicated-execution instructions. The control information holding unit (131) holds items of control information which correspond respectively to the predicate registers, and each operating unit also corresponds individually to one of the predicate registers. Each operating unit has a first control input connected to the control information holding unit (131) for receiving the control-information item corresponding to its unit's own corresponding predicate register and also has a second control input connected for receiving the control-information item corresponding to a further one of the predicate registers. Each operating unit is operable to ...

Подробнее
04-04-2019 дата публикации

VERFAHREN UND VORRICHTUNG ZUM ABBILDEN VON SINGLE-STATIC-ASSIGNMENT-ANWEISUNGEN AUF EINEN DATENFLUSSGRAPHEN IN EINER DATENFLUSSARCHITEKTUR

Номер: DE102018214541A1
Принадлежит:

Verfahren, Vorrichtungen, Systeme und Herstellungsartikel zum Abbilden eines Satzes von Anweisungen auf einen Datenflussgraphen werden hierin offenbart. Ein beispielhaftes Gerät beinhaltet einen Variablen-Handler zum Modifizieren einer Variablen in dem Satz von Anweisungen. Die Variable wird mehrere Male in dem Satz von Anweisungen verwendet, und der Satz von Anweisungen liegt in Static-Single-Assignment-Form vor. Die Vorrichtung beinhaltet außerdem einen PHI-Handler zum Ersetzen einer in dem Satz von Anweisungen enthaltenen PHI-Anweisung durch einen Satz von Datenflusskontrollanweisungen und einen Datenflussgraph-Generator zum Abbilden des durch den Variablen-Handler und den PHI-Handler modifizierten Satzes von Anweisungen auf einen Datenflussgraphen, ohne dass die Anweisungen aus der Static-Single-Assignment-Form transformiert werden.

Подробнее
26-01-2012 дата публикации

Parallel loop management

Номер: US20120023316A1
Принадлежит: International Business Machines Corp

The illustrative embodiments comprise a method, data processing system, and computer program product having a processor unit for processing instructions with loops. A processor unit creates a first group of instructions having a first set of loops and second group of instructions having a second set of loops from the instructions. The first set of loops have a different order of parallel processing from the second set of loops. A processor unit processes the first group. The processor unit monitors terminations in the first set of loops during processing of the first group. The processor unit determines whether a number of terminations being monitored in the first set of loops is greater than a selectable number of terminations. In response to a determination that the number of terminations is greater than the selectable number of terminations, the processor unit ceases processing the first group and processes the second group.

Подробнее
11-01-2018 дата публикации

METHODS AND APPARATUS TO ELIMINATE PARTIAL-REDUNDANT VECTOR LOADS

Номер: US20180011693A1
Принадлежит:

Methods, apparatus, systems and articles of manufacture are disclosed to eliminate partial-redundant vector load operations. An example apparatus includes a node grouper to associate a vector operation with a node group, a candidate verifier to perform a dependencies test on a subset of the node group, and identify a subset of the node group as a candidate when the subset satisfies the dependencies test, and a code optimizer to determine replacement code based on a characteristic of the candidate in the node group and compare an estimated cost associated with executing the replacement code to a threshold. The example apparatus also includes a code generator to generate machine code using the replacement code when the estimated cost of executing the replacement code satisfies the threshold. 1. An apparatus to eliminate partial-redundant vector load operations , the apparatus comprising:a node grouper to associate a vector operation with a node group; perform a dependencies test on a subset of the node group; and', 'identify a subset of the node group as a candidate when the subset satisfies the dependencies test;, 'a candidate verifier to determine replacement code based on a characteristic of the candidate in the node group; and', 'compare an estimated cost associated with executing the replacement code to a threshold; and, 'a code optimizer toa code generator to generate machine code using the replacement code when the estimated cost of executing the replacement code satisfies the threshold, at least one of the node grouper, the candidate verifier, the code optimizer and the code generator implemented by hardware.2. An apparatus as defined in claim 1 , further including a vector load identifier to parse the vector operation to identify a load type of the vector operation.3. An apparatus as defined in claim 2 , wherein the node group is a first node group and the node grouper is to create a second node group corresponding to the load type when the subset of the ...

Подробнее
18-01-2018 дата публикации

INFORMATION PROCESSING DEVICE, STORAGE MEDIUM, AND METHOD

Номер: US20180018153A1
Автор: MUKAI Yuta
Принадлежит: FUJITSU LIMITED

A device includes a processor configured to: divide loop in a program into first loop and second loop when compiling the program, the loop accessing data of an array and prefetching data of the array to be accessed at a repetition after prescribed repetitions at each repetition, the first loop including one or more repetitions from an initial repetition to a repetition immediately before the repetition after the prescribed repetitions, the second loop including one or more repetitions from the repetition after the prescribed repetitions to a last repetition, and generate an intermediate language code configured to access data of the array using a first region in a cache memory and prefetch data of the array using a second region in the cache memory in the first loop, and to access and prefetch data of the array using the second region in the second loop. 1. An information processing device comprising:a memory; and divide loop processing in a program into first loop processing and second loop processing when compiling the program, the loop processing accessing data of an array and prefetching data of the array to be accessed at a repetition processing after prescribed repetition processings at each repetition processing in the loop processing, the first loop processing including one or more repetition processings from an initial repetition processing to a repetition processing immediately before the repetition processing after the prescribed repetition processings, the second loop processing including one or more repetition processings from the repetition processing after the prescribed repetition processings to a last repetition processing, and', 'generate an intermediate language code based on the program when compiling the program, the intermediate language code being configured to access data of the array by using a first region in a cache memory and prefetch data of the array by using a second region in the cache memory in the first loop processing, and to ...

Подробнее
24-04-2014 дата публикации

SYSTEMS AND METHODS FOR PARALLELIZATION OF PROGRAM CODE, INTERACTIVE DATA VISUALIZATION, AND GRAPHICALLY-AUGMENTED CODE EDITING

Номер: US20140115560A1
Автор: Hutchison Luke
Принадлежит:

A system for providing a computer configured to read an immutable value for a variable; read the value of the variable at a specific timestamp, thereby providing an ability to create looping constructs; set a current or next value of a loop variable as a function of previous or current loop variable values; read a set of all values that a variable will assume; push or scattering the values into unordered collections; and reduce the collections into a single value. 1. A computer comprising tangible computer readable storage media and a processor , the storage media comprises instructions to cause the processor to:read an immutable value for a variable;read the value of the variable at a specific timestamp, thereby providing an ability to create looping constructs;set a current or next value of a loop variable as a function of previous or current loop variable values;read a set of all values that a variable will assume;push or scatter the values into unordered collections; andreduce the collections into a single value.2. The storage media of claim 1 , wherein the push or scatter instruction comprises pushing values into bins based on a key.3. The storage media of claim 1 , wherein the collections comprise a type claim 1 , and the storage media provides a type system to constrain the type of any collections that are recipients of push operations to be unordered.4. The storage media of claim 3 , wherein any fold or reduce operations applied to those collections requires the collections to be unordered.5. The storage media of claim 3 , comprising a scatter operation configured to directly map into a MapReduce-style computation.6. A computer comprising tangible computer readable storage media and a processor claim 3 , the storage media comprises instructions to cause the processor to: provide an integrated development environment; the environment comprising:a textual editor for a lattice-based programming language; the textual editor configured to show program source code ...

Подробнее
01-02-2018 дата публикации

Processor for Correlation-Based Loop Detection

Номер: US20180032340A1
Принадлежит:

Processor comprising an execution unit and a detection unit which are functionally connected, wherein the execution unit is configured to execute computer programs, and wherein the detection unit is configured to detect infinite loops during the execution of a computer program in the execution unit during run-time, wherein the computer program comprises a plurality of go-to instructions, wherein each go-to instruction is characterized by a corresponding branch address, wherein the detection unit is configured to calculate a detection function of the branch addresses of a branch sequence, the branch sequence comprising a sequence of executed go-to instructions, and wherein the detection function is chosen such that an increased value of the detection function is characteristic of an infinite loop in the branch sequence in which at least one go-to instruction is repeated. 1. A processor comprising an execution unit and a detection unit that are functionally connected , wherein the execution unit is configured to execute computer programs , and wherein the detection unit is configured to detect infinite loops during the execution of a computer program in the execution unit during the run-time of the computer program;wherein the computer program comprises a plurality of go-to instructions,wherein each go-to instruction is characterized by a corresponding branch address;wherein the detection unit is configured to calculate a detection function,wherein the detection function is a function of the branch addresses of a branch sequence, wherein the branch sequence comprises a sequence of executed go-to instructions; andwherein the detection function is selected such that an increased value of the detection function is characteristic of an infinite loop in which at least one go-to instruction is repeated.2. The processor according to claim 1 , which is further configured to compare the detection function with a threshold value in order to detect the existence of an infinite ...

Подробнее
01-02-2018 дата публикации

LOOP VECTORIZATION METHODS AND APPARATUS

Номер: US20180032342A1
Принадлежит:

Loop vectorization methods and apparatus are disclosed. An example method includes generating a first control mask for a set of iterations of a loop by evaluating a condition of the loop, wherein generating the first control mask includes setting a bit of the control mask to a first value when the condition indicates that an operation of the loop is to be executed, and setting the bit of the first control mask to a second value when the condition indicates that the operation of the loop is to be bypassed. The example method also includes compressing indexes corresponding to the first set of iterations of the loop according to the first control mask. 1. (canceled)2. An apparatus comprising:an array populater to populate an array with a first set of data elements based on a control mask, the control mask indicating whether an operation of a loop is to be performed on the first set of data elements;a register loader to load a first subset of data elements from the array into a register, the first subset of data elements corresponding to a number of data elements stored by the register, the array populator to move a second subset of data elements within the array after the first subset of data elements has been loaded into the register; andcomputation circuitry to perform the operation of the loop on the first subset of data elements in the register.3. The apparatus of claim 2 , further including a control mask generator to generate the control mask for a first set of iterations of the loop by:setting a bit of the control mask to a first value when a condition of the loop indicates that an operation of the loop is to be executed; andsetting the bit of the control mask to a second value when the condition indicates that the operation of the loop is to be bypassed.4. The apparatus of claim 2 , wherein the array populater is to move the second subset of data elements after the computation circuitry performs of the operation of the loop on the data elements in the register. ...

Подробнее
09-02-2017 дата публикации

COMPILING SOURCE CODE TO REDUCE RUN-TIME EXECUTION OF VECTOR ELEMENT REVERSE OPERATIONS

Номер: US20170039048A1
Принадлежит:

Compiling source code to reduce run-time execution of vector element reverse operations, includes: identifying, by a compiler, a first loop nested within a second loop in a computer program; identifying, by the compiler, a vector element reverse operation within the first loop; moving, by the compiler, the vector element reverse operation from the first loop to the second loop. 17-. (canceled)8. An apparatus for compiling source code to reduce run-time execution of vector element reverse operations , the apparatus comprising a computer processor , a computer memory operatively coupled to the computer processor , the computer memory having disposed within it computer program instructions that , when executed by the computer processor , cause the apparatus to carry out the steps of:identifying, by a compiler, a first loop in a computer program;identifying, by the compiler, at least one vector element reverse operation within the first loop;analyzing, by the compiler, a dataflow graph containing that at least one vector element reverse operation within the first loop, including determining whether all vector operations in a portion of the dataflow graph including the first loop are lane-insensitive and determining whether all vector operations in the portion of the dataflow graph containing the first loop are lane-adjustable; andresponsive to the analysis, replacing, by the compiler, the vector element reverse operations from the first loop by vector element reverse operations outside the first loop.9. The apparatus of wherein:identifying at least one vector element reverse operation within the first loop further comprises identifying t least one vector operation within the first loop having a live-in vector value; andreplacing the vector element reverse operations from the first loop by vector element reverse operations outside the first loop further comprises inserting vector element reverse operations at an incoming perimeter of the first loop.10. The apparatus of ...

Подробнее
06-02-2020 дата публикации

OPTIMIZATION OF LOOPS AND DATA FLOW SECTIONS IN MULTI-CORE PROCESSOR ENVIRONMENT

Номер: US20200042492A1
Автор: Vorbach Martin
Принадлежит: HYPERION CORE, INC.

The present invention relates to a method for compiling code for a multi-core processor, comprising: detecting and optimizing a loop, partitioning the loop into partitions executable and mappable on physical hardware with optimal instruction level parallelism, optimizing the loop iterations and/or loop counter for ideal mapping on hardware, chaining the loop partitions generating a list representing the execution sequence of the partitions. 1. A method for operating a processor that comprises a multitude of data processing units , the method comprising:dividing a thread into a plurality of partitions executable on the data processing units, each partition including a plurality of instructions; andchaining the partitions together for transferring data at least from a first partition to a second partition of the plurality of partitions,wherein each of the partitions forms a code entity which is processed as a whole such that data is processed in each instruction of the partition after a preceding instruction of the partition without interruption.2. The method according to claim 1 , wherein the processor is a graphics processor.3. The method according to claim 1 , wherein the data processing units process VLIW. This application is a continuation of U.S. patent application Ser. No. 15/601,946, filed May 22, 2017, which is a continuation of U.S. patent application Ser. No. 14/693,793, filed Apr. 22, 2015 (now U.S. Pat. No. 9,672,188), which is a continuation of U.S. patent application Ser. No. 13/519,887, filed Nov. 6, 2012 (now U.S. Pat. No. 9,043,769), which claims priority as a national phase application of International Patent Application No. PCT/EP2010/007950, filed Dec. 28, 2010, which claims priority to European Patent Application No. EP10007074.7, filed Jul. 9, 2010, European Patent Application No. EP10002086.6, filed Mar. 2, 2010, European Patent Application No. EP10000349.0, filed Jan. 15, 2010, and European Patent Application No. EP09016045.8, filed Dec. 28, ...

Подробнее
06-02-2020 дата публикации

BUFFER OVERFLOW DETECTION BASED ON A SYNTHESIS OF ASSERTIONS FROM TEMPLATES AND K-INDUCTION

Номер: US20200042697A1
Принадлежит: ORACLE INTERNATIONAL CORPORATION

A method for buffer overflow detection involves obtaining a program code configured to access memory locations in a loop using a buffer index variable, obtaining an assertion template configured to capture a dependency between the buffer index variable and a loop index variable of the loop in the program code, generating an assertion using the assertion template, verifying that the assertion holds using a k-induction; and determining whether a buffer overflow exists using the assertion. 1. A method for buffer overflow detection comprising:obtaining a program code configured to access memory locations in a loop using a buffer index variable;obtaining an assertion template configured to capture a dependency between the buffer index variable and a loop index variable of the loop in the program code;generating an assertion using the assertion template;verifying that the assertion holds, using a k-induction; anddetermining whether a buffer overflow exists using the assertion.2. The method of claim 1 , wherein determining whether the buffer overflow exists comprises making a determination claim 1 , using the assertion and a memory allocation specified in the program code claim 1 , that an execution of the program results in the buffer index variable to point to a memory location beyond the memory allocation during at least one execution of the loop.3. The method of claim 1 ,wherein the assertion template and the assertion are for an upper bound of the buffer index variable, andwherein a second assertion template is used to generate a second assertion for a lower bound of the buffer index variable.4. The method of claim 1 , wherein the assertion template establishes a linear relationship between the buffer index variable and the loop index variable.5. The method of claim 1 ,wherein the generated assertion establishes a boundary for the buffer index variable, based on the loop index variable, andwherein the generated assertion, prior to verifying the assertion, is assumed ...

Подробнее
10-03-2022 дата публикации

Hardware Acceleration Method, Compiler, and Device

Номер: US20220075608A1
Принадлежит:

A hardware acceleration method includes obtaining compilation policy information and a source code, where the compilation policy information indicates that a first code type matches a first processor and a second code type matches a second processor; analyzing a code segment in the source code according to the compilation policy information; determining a first code segment belonging to the first code type or a second code segment belonging to the second code type; compiling the first code segment into a first executable code; sending the first executable code to the first processor; compiling the second code segment into a second executable code; and sending the second executable code to the second processor. 1. A hardware acceleration method comprising:obtaining source code;obtaining, according to the source code, first executable code matching a first processor and running in the first processor;receiving, from the first processor, first execution information for executing the first executable code, wherein the first execution information comprises a first execution parameter of the first executable code in the first processor, and wherein the first execution parameter is an execution duration of the first executable code in the first processor;determining that the source code corresponding to the first executable code matches a second processor when the first execution parameter exceeds a first threshold, wherein the first threshold is based on an estimation of a second execution parameter of the source code in the second processor, and wherein the second execution parameter is an estimated execution duration of the source code in the second processor; andobtaining, according to the source code and when the source code matches the second processor, second executable code matching the second processor.2. The hardware acceleration method of claim 1 , further comprising:unloading the first executable code from the first processor; andsending, to the second ...

Подробнее
05-03-2015 дата публикации

CODE PROFILING OF EXECUTABLE LIBRARY FOR PIPELINE PARALLELIZATION

Номер: US20150067663A1
Принадлежит:

A method and system for creating a library method stub in source code form corresponding to an original library call in machine-executable form. The library method stub is created in a predefined programming language by use of a library method signature associated with the original library call, at least one idiom sentence, and a call invoking the original library call. Creating the library method stub includes composing source code of the library method stub by matching the at least one idiom sentence with idiom-stub mappings predefined for each basic idiom of at least one basic idiom. The original library call appears in sequential code. The library method signature specifies formal arguments of the original library call. The at least one idiom sentence summarizes memory operations performed by the original library call on the formal arguments. The created library method stub is stored in a database. 1. A method for creating a library method stub in source code form corresponding to an original library call in machine-executable form , said method comprising:creating, by a computer processor, the library method stub in a predefined programming language by use of a library method signature associated with the original library call, at least one idiom sentence, and a call invoking the original library call, wherein said creating the library method stub comprises composing source code of the library method stub by matching the at least one idiom sentence with idiom-stub mappings predefined for each basic idiom of at least one basic idiom, wherein the original library call appears in sequential code, wherein the library method signature specifies formal arguments of the original library call, wherein the at least one idiom sentence summarizes memory operations performed by the original library call on the formal arguments, and wherein a sentence S of the at least one basic idiom provides at least one rule for generating a composition of literals to generate a complex ...

Подробнее
16-03-2017 дата публикации

METHOD FOR SECURING A PROGRAM

Номер: US20170075788A1
Автор: Bolignano Dominique
Принадлежит:

A method for securing a first program, the first program including a finite number of program points and evolution rules associated to program points and defining the passage of a program point to another, the method including defining a plurality of exit cases and, when a second program is used in the definition of the first program, for each exit case, definition of a branching toward a specific program point of the first program or a declaration of branching impossibility, defining a set of properties to be proven, each associated with one of the constitutive elements of the first program, said set of properties comprising the branching impossibility as a particular property and establishment of the formal proof of the set of properties. 1. (canceled)2. A method for securing a first program , the first program comprising a finite number of program points and evolution rules associated with the program points and defining the passage from one program point to another program point , the method comprising:defining a plurality of exit cases in a non-transitory computer readable medium and, when a second program is used in the definition of the first program, for each exit case of the second program, defining a branching toward a specific program point of the first program or a declaration of branching impossibility, wherein the branching impossibility comprises a normally possible transition proved impossible;defining a set of local properties to be proven, each associated with one or more of the program points and evolution rules of the first program, said set of local properties comprising the branching impossibility as a particular local property; andestablishing a formal proof of the set of properties absent a concrete execution of either the first program or the second program.3. The method according to claim 2 , wherein the evolution rules claim 2 , the exit cases claim 2 , and the branchings define a tree structure of logic traces claim 2 , and wherein a ...

Подробнее
18-03-2021 дата публикации

SYSTEM AND METHOD FOR COMPILING HIGH-LEVEL LANGUAGE CODE INTO A SCRIPT EXECUTABLE ON A BLOCKCHAIN PLATFORM

Номер: US20210081185A1
Принадлежит:

A computer-implemented method (and corresponding system) is provided that enables or facilitates the execution of a portion of source code, written in a high-level language (HLL), on a blockchain platform. The method and system can include a blockchain compiler, arranged to convert a portion of high-level source code into a form that can be used with a blockchain platform. This may be the Bitcoin blockchain or an alternative. The method can include: receiving the portion of source code as input; and generating an output script comprising a plurality of op codes. The op codes are a subset of op codes that are native to a functionally-restricted, blockchain scripting language. The outputted script is arranged and/or generated such that, when executed, the script provides, at least in part, the functionality specified in the source code. The blockchain scripting language is restricted such that it does not natively support complex control-flow constructs or recursion via jump-based loops or other recursive programming constructs. The step of generating the output script may comprise the unrolling at least one looping construct provided in the source code. The method may further comprise providing or using an interpreter or virtual machine arranged to convert the output script into a form that is executable on a blockchain platform. 1. A computer-implemented method comprising the steps:receiving a portion of source code as input, wherein the portion of source code is written in a high-level language (HLL); andgenerating an output script comprising a plurality of op_codes selected from and/or native to a functionally-restricted, blockchain scripting language such that, when executed, the script provides, at least in part, the functionality specified in the portion of source code.2. A method according to and comprising the step of: providing or using a compiler arranged to perform the steps of .3. A method according to wherein the output script is generated by performing ...

Подробнее
31-03-2016 дата публикации

Method and Apparatus for Approximating Detection of Overlaps Between Memory Ranges

Номер: US20160092285A1
Принадлежит: Intel Corp

A computer-implemented method for managing loop code in a compiler includes using a conflict detection procedure that detects across-iteration dependency for arrays of single memory addresses to determine whether a potential across-iteration dependency exists for arrays of memory addresses for ranges of memory accessed by the loop code.

Подробнее
30-03-2017 дата публикации

Low-Layer Memory for a Computing Platform

Номер: US20170091094A1

The present disclosure relates to low-layer memory for a computing platform. An example embodiment includes a memory hierarchy being directly connectable to a processor. The memory hierarchy includes at least a level 1, referred to as L 1, memory structure comprising a non-volatile memory unit as L 1 data memory and a buffer structure (L 1 -VWB). The buffer structure includes a plurality of interconnected wide registers with an asymmetric organization, wider towards the non-volatile memory unit than towards a data path connectable to the processor. The buffer structure and the non-volatile memory unit are arranged for being directly connectable to a processor so that data words can be read directly from either of the L 1 data memory and the buffer structure (L 1 -VWB) by the processor.

Подробнее
19-03-2020 дата публикации

Hardware Acceleration Method, Compiler, and Device

Номер: US20200089480A1
Принадлежит:

A hardware acceleration method includes: obtaining compilation policy information and a source code, where the compilation policy information indicates that a first code type matches a first processor and a second code type matches a second processor, analyzing a code segment in the source code according to the compilation policy information, determining a first code segment belonging to the first code type or a second code segment belonging to the second code type, compiling the first code segment into a first executable code, sending the first executable code to the first processor, compiling the second code segment into a second executable code, and sending the second executable code to the second processor. 1. A hardware acceleration method implemented by a compiler , the hardware acceleration method comprising:obtaining compilation policy information and source code, wherein the compilation policy information indicates that a first code type matches a first processor and a second code type matches a second processor;determining a first code segment belonging to the first code type and a second code segment belonging to the second code type according to the compilation policy information;compiling the first code segment into first executable code;sending the first executable code to the first processor;compiling the second code segment into second executable code; andsending the second executable code to the second processor, stopping the third executable code when a busy degree of the second processor is higher than a first preset threshold;', 'compiling a third code segment corresponding to the third executable code into a fourth executable code matching the first processor; and', 'sending the fourth executable code to the first processor., 'wherein when a priority of a first process corresponding to the second code segment is higher than a priority of a second process corresponding to a third executable code being executed in the second processor, before ...

Подробнее
07-04-2016 дата публикации

Method and system for automated improvement of parallelism in program compilation

Номер: US20160098258A1
Принадлежит: Individual

A method of program compilation to improve parallelism during the linking of the program by a compiler. The method includes converting statements of the program to canonical form, constructing a traversable representation, such as an abstract syntax tree (AST), for each procedure in the program, and traversing the program to construct a graph by making each non-control flow statement and each control structure into at least one node of the graph.

Подробнее
04-04-2019 дата публикации

Processors, methods, and systems with a configurable spatial accelerator having a sequencer dataflow operator

Номер: US20190102338A1
Принадлежит: Intel Corp

Systems, methods, and apparatuses relating to a sequencer dataflow operator of a configurable spatial accelerator are described. In one embodiment, an interconnect network between a plurality of processing elements receives an input of a dataflow graph comprising a plurality of nodes forming a loop construct, wherein the dataflow graph is overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements and at least one dataflow operator controlled by a sequencer dataflow operator of the plurality of processing elements, and the plurality of processing elements is to perform an operation when an incoming operand set arrives at the plurality of processing elements and the sequencer dataflow operator generates control signals for the at least one dataflow operator in the plurality of processing elements.

Подробнее
27-04-2017 дата публикации

CODE PROFILING OF EXECUTABLE LIBRARY FOR PIPELINE PARALLELIZATION

Номер: US20170115974A1
Принадлежит:

A method and system. A library method stub is created in a predefined programming language by use of a library method signature associated with an original library call, at least one idiom sentence, and a call invoking the original library call. Creating the library method stub includes composing source code of the library method stub by matching the at least one idiom sentence with idiom-stub mappings predefined for each basic idiom of at least one basic idiom. The original library call appears in sequential code. The library method signature specifies formal arguments of the original library call. The at least one idiom sentence summarizes memory operations performed by the original library call on the formal arguments. The created library method stub is stored in a database. 1. A method for creating a library method stub in source code form corresponding to an original library call in machine-executable form , said method comprising:creating, by a computer processor, a library method stub in a predefined programming language by use of a library method signature associated with an original library call, at least one idiom sentence, and a call invoking the original library call, wherein said creating the library method stub comprises composing source code of the library method stub by matching the at least one idiom sentence with idiom-stub mappings predefined for each basic idiom of at least one basic idiom, wherein the original library call appears in sequential code, wherein the library method signature specifies formal arguments of the original library call, and wherein the at least one idiom sentence summarizes memory operations performed by the original library call on the formal arguments; andsaid processor storing the created library method stub in a database.2. The method of claim 1 , wherein a sentence S of the at least one basic idiom provides at least one rule for generating a composition of literals to generate a complex idiom.3. The method of claim 2 ...

Подробнее
18-04-2019 дата публикации

Method and system for converting a single-threaded software program into an application-specific supercomputer

Номер: US20190114158A1
Принадлежит: Global Supercomputing Corporation

The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions. 2. The directory-mapped coherent memory hierarchy of with the write-update protocol claim 1 , further comprising:a set of line states comprising an exclusive state, a shared state, an invalid state but not including a modified state. This application claims priority, as a continuation application, to U.S. patent application Ser. No. 15/257,319 filed on Sep. 6, 2016, which claims priority, as a continuation application, to U.S. patent application Ser. No. 14/581,169 filed on Dec. 23, 2014, now U.S. Pat. No. 9,495,223, which claims priority, as a continuation application, to U.S. patent application Ser. No. 13/296,232 filed on Nov. 15, 2011, now U.S. Pat. No. 8,966,457. Ser. Nos. 15/257,319, 14/581,169, 13/296,232, U.S. Pat. Nos. 9,495,223 and 8,966,457 are hereby incorporated by reference.The invention relates to the conversion of a single-threaded software program into an application-specific supercomputer.It is much more difficult to write parallel ...

Подробнее
03-05-2018 дата публикации

Hardware acceleration method, compiler, and device

Номер: US20180121180A1
Принадлежит: Huawei Technologies Co Ltd

A hardware acceleration method, a compiler, and a device, to improve code execution efficiency and implement hardware acceleration. The method includes: obtaining, by a compiler, compilation policy information and source code, where the compilation policy information indicates that a first code type matches a first processor and a second code type matches a second processor; analyzing, by the compiler, a code segment in the source code according to the compilation policy information, and determining a first code segment belonging to the first code type or a second code segment belonging to the second code type; and compiling, by the compiler, the first code segment into first executable code, and sending the first executable code to the first processor; and compiling the second code segment into second executable code, and sending the second executable code to the second processor.

Подробнее
14-05-2015 дата публикации

Information processing apparatus and compilation method

Номер: US20150135171A1
Принадлежит: Fujitsu Ltd

A storage unit stores source code including loop processing that is written with an array referenced by an index, a loop variable, and a parameter. A computing unit generates a conditional expression indicating that the index of the array satisfies a predetermined condition, using the loop variable and the parameter. The computing unit generates determination information on the parameter, by eliminating the loop variable from the conditional expression through formula manipulation. Then, the computing unit generates object code corresponding to the source code in accordance with the determination information.

Подробнее
01-09-2022 дата публикации

Using hardware-accelerated instructions

Номер: US20220276865A1
Принадлежит: ROBERT BOSCH GMBH

A computer-implemented method of implementing a computation using a hardware-accelerated instruction of a processor system by solving a constraint satisfaction problem. A solution to the constraint satisfaction problem represents a possible invocation of the hardware-accelerated instruction in the computation. The constraint satisfaction problem assigns nodes of a data flow graph of the computation to nodes of a data flow graph of the instruction. The constraint satisfaction problem comprises constraints enforcing that the assigned nodes of the computation data flow graph have equivalent data flow to the instruction data flow graph, and constraints restricting which nodes of the computation data flow graph can be assigned to the inputs of the hardware-accelerated instruction, with restrictions being imposed by the hardware-accelerated instruction and/or its programming interface.

Подробнее
19-05-2016 дата публикации

SYSTEMS, METHODS, AND COMPUTER PROGRAMS FOR PERFORMING RUNTIME AUTO PARALLELIZATION OF APPLICATION CODE

Номер: US20160139901A1
Принадлежит:

Systems, methods, and computer programs are disclosed for performing runtime auto-parallelization of application code. One embodiment of such a method comprises receiving application code to be executed in a multi-processor system. The application code comprises an injected code cost computation expression for at least one loop in the application code defining a serial workload for processing the loop. A runtime profitability check of the loop is performed based on the injected code cost computation expression to determine whether the serial workload can be profitably parallelized. If the serial workload can be profitably parallelized, the loop is executed in parallel using two or more processors in the multi-processor system. 1. A method for performing runtime auto-parallelization of application code , the method comprising:receiving application code to be executed in a multi-processor system, the application code comprising an injected code cost computation expression for at least one loop in the application code defining a serial workload for processing the loop;performing a runtime profitability check of the loop based on the injected code cost computation expression to determine whether the serial workload can be profitably parallelized; andif the serial workload can be profitably parallelized, executing the loop in parallel using two or more processors in the multi-processor system.2. The method of claim 1 , wherein the performing the runtime profitability check comprises:computing a parallelized workload based on an available number of processors; anddetermining whether a sum of the parallelized workload and a parallelization overhead parameter exceeds the serial workload.3. The method of claim 1 , wherein the injected code cost computation expression defines a first static portion of the serial workload defined at compile time and a second dynamic portion of the serial workload to be computed at runtime.4. The method of claim 3 , wherein the performing the ...

Подробнее
26-05-2016 дата публикации

SYSTEMS AND METHODS FOR STENCIL AMPLIFICATION

Номер: US20160147514A1
Принадлежит:

In a sequence of major computational steps or in an iterative computation, a stencil amplifier can increase the number of data elements accessed from one or more data structures in a single major step or iteration, thereby decreasing the total number of computations and/or communication operations in the overall sequence or the iterative computation. Stencil amplification, which can be optimized according to a specified parameter such as compile time, rune time, code size, etc., can improve the performance of a computing system executing the sequence or the iterative computation in terms of run time, memory load, energy consumption, etc. The stencil amplifier typically determines boundaries, to avoid erroneously accessing data elements not present in the one or more data structures. 1. A method for improving processing efficiency , the method comprising performing by a processor the steps of:in a computation sequence comprising a plurality of sequence steps, identifying a computation using a stencil comprising a set of stencil points, each stencil point corresponding to a value of a respective element of a data structure in a current sequence step; andmodifying the computation by replacing a stencil point with a first-level substencil comprising a set of first-level substencil points, each first-level substencil point corresponding to a value of a respective element of the data structure from a first previous sequence step, at least one first-level substencil point being associated with a data-structure element that is different from data-structure elements associated with all stencil points.2. The method of claim 1 , wherein modifying the computation comprises generating a loop nest corresponding to at least one stencil point claim 1 , the loop nest comprising a loop corresponding to a parameterized dimension of the data structure claim 1 , the loop comprising a statement accessing an element of the data structure in the parameterized dimension according to a ...

Подробнее
31-05-2018 дата публикации

CROSS-MACHINE BUILD SCHEDULING SYSTEM

Номер: US20180150286A1
Принадлежит: Microsoft Technology Licensing, LLC

Cross-machine build scheduling of a codebase is provided. Systems, methods and computer-readable devices provide for breaking a monolithic codebase into a plurality of tenants. A file containing entries associated with one of the tenants is read, and a selected entry in the file is examined to determine if the entry is requesting the execution of parallel loop. If so, each loop of the parallel loops is executed in parallel, and the selected entry in the file is examined to determine if the entry is an independent loop. If so, the independent loop is executed, and the selected entry in the file is examined to determine if the entry is a dependent loop. If so, execution of the dependent loop is held. 1. A system comprising a computing device , the computing device comprising:a processing device; and breaking a monolithic codebase into a plurality of tenants;', 'reading a file containing entries associated with one of the tenants;', 'examining a selected entry in the file to determine if the entry is requesting the execution of parallel loops, and if so, executing each loop of the parallel loops in parallel;', 'examining the selected entry in the file to determine if the entry is an independent loop, and if so, executing the independent loop; and', 'examining the selected entry in the file to determine if the entry is a dependent loop, and if so, holding execution of the dependent loop., 'a computer readable data storage device storing instructions that, when executed by the processing device are operative to provide2. The system of further comprising claim 1 , releasing for execution the dependent loop once its pre-validation loop has successfully completed execution.3. The system of claim 1 , wherein the loop comprises a request to build the tenant.4. The system of claim 1 , wherein the loop comprises a request to debug the tenant.5. The system of claim 1 , wherein the loop comprises a request to test the tenant.6. The system of claim 2 , wherein the dependent loop ...

Подробнее
07-06-2018 дата публикации

SYSTEMS AND METHODS FOR GENERATING CODE FOR PARALLEL PROCESSING UNITS

Номер: US20180157471A1
Принадлежит:

Systems and methods generate code from a source program where the generated code may be compiled and executed on a Graphics Processing Unit (GPU). A parallel loop analysis check may be performed on regions of the source program identified for parallelization. One or more optimizations also may be applied to the source program that convert mathematical operations into a parallel form. The source program may be partitioned into segments for execution on a host and a device. Kernels may be created for the segments to be executed on the device. The size of the kernels may be determined, and memory transfers between the host and device may be optimized. 1. A method comprising:for a source program having a format for sequential execution, generating one or more in-memory intermediate representations (IRs) for the source program;', 'partitioning, by the processor, the one or more in-memory IRs for the source program into serial code segments and parallel code segments identified as suitable for parallel execution, wherein at least one of the parallel code segments includes a nested loop structure of for-loops, the for-loops of the nested structure including loop bounds;', 'determining, by the processor, a number of thread blocks and a number of threads per thread block for executing the at least one of the parallel code segments, the determining based on an analysis of the at least one of the parallel code segments; and', identifying sets of the for-loops of the nested loop structure that satisfy a parallel loop analysis check and that are contiguous within the nested loop structure;', 'identifying the set from the sets of the for-loops using a criteria;', 'converting the set of the for-loops whose product of the loop bounds is largest to the kernel for parallel execution by the parallel processing device; and', 'adding a kernel launch directive that includes the number of thread blocks and the number of threads per block,, 'converting, by the processor, the at least one ...

Подробнее
18-06-2015 дата публикации

GENERAL PURPOSE SOFTWARE PARALLEL TASK ENGINE

Номер: US20150169305A1
Принадлежит:

A software engine for decomposing work to be done into tasks, and distributing the tasks to multiple, independent CPUs for execution is described. The engine utilizes dynamic code generation, with run-time specialization of variables, to achieve high performance. Problems are decomposed according to methods that enhance parallel CPU operation, and provide better opportunities for specialization and optimization of dynamically generated code. A specific application of this engine, a software three dimensional (3D) graphical image renderer, is described. 1. In a computer system , a parallel task engine for performing tasks on data , the parallel task engine comprising:an input for receiving tasks, each task for performing an operation;a scheduler for decomposing the tasks into one or more new tasks, the decomposing being dependent on at least one policy selected from a given set of policies;a run-time dynamic code generator for generating, for the new tasks, operation routines, the run-time dynamic code generator comprising a dynamic compiler, the dynamic compiler being adapted to output the operation routines for execution;a set of job loops, at least one of the job loops for performing the new tasks on at least part of the data by executing the operation routines, the job loops running in parallel on two or more CPUs;the scheduler for distributing and assigning the new tasks to the at least one of the job loops; andthe scheduler for making the selection of the at least one policy based on general heuristics.2. The parallel task engine of claim 1 , wherein the given set of policies include one or more of: by-domain policies claim 1 , and by-component policies.3. The parallel task engine of claim 2 , wherein the scheduler performs by-domain decomposition on a given task by modifying data pointers or parameters of the given task.4. The parallel task engine of claim 2 , wherein the scheduler performs by-component decomposition on a given task having a full datum ...

Подробнее
28-06-2018 дата публикации

Parallel program generating method and parallelization compiling apparatus

Номер: US20180181380A1
Принадлежит: WASEDA UNIVERSITY

There is provided a parallel program generating method capable of generating a static scheduling enabled parallel program without undermining the possibility of extracting parallelism. The parallel program generating method executed by the parallelization compiling apparatus 100 includes a fusion step (FIG. 2 /STEP 026 ) of fusing, as a new task, a task group including a reference task as a task having a conditional branch, and subsequent tasks as tasks control dependent, extended-control dependent, or indirect control dependent on respective of all branch directions of the conditional branch included in the reference task.

Подробнее
04-06-2020 дата публикации

Method for controlling the flow execution of a generated script of a blockchain transaction

Номер: US20200174762A1
Принадлежит: nChain Holdings Ltd

The invention provides a computer-implemented method (and corresponding system) for generating a blockchain transaction (Tx). This may be a transaction for the Bitcoin blockchain or another blockchain protocol. The method comprises the step of using a software resource to receive, generate or otherwise derive at least one data item; and then insert, at least once, at least one portion of code into a script associated the transaction. Upon execution of the script, the portion of code provides the functionality of a control flow mechanism, the behaviour of the control flow mechanism being controlled or influenced by the at least one data item. In one embodiment, the code is copied/inserted into the script more than once. The control flow mechanism can be a loop, such as a while or for loop, or a selection control mechanism such as a switch statement. Thus, the invention allows the generation of a more complex blockchain script and controls how the script will execute when implemented on the blockchain. This, in turns provides control over how or when the output of the blockchain transaction is unlocked.

Подробнее
29-07-2021 дата публикации

SYSTEMS AND METHODS FOR SCALABLE HIERARCHICAL POLYHEDRAL COMPILATION

Номер: US20210232379A1
Принадлежит:

A system for compiling programs for execution thereof using a hierarchical processing system having two or more levels of memory hierarchy can perform memory-level-specific optimizations, without exceeding a specified maximum compilation time. To this end, the compiler system employs a polyhedral model and limits the dimensions of a polyhedral program representation that is processed by the compiler at each level using a focalization operator that temporarily reduces one or more dimensions of the polyhedral representation. Semantic correctness is provided via a defocalization operator that can restore all polyhedral dimensions that had been temporarily removed. 1. A method for optimizing execution of a program by a processing system comprising a hierarchical memory having a plurality of memory levels , the method comprising: removing an iterator corresponding to a loop index associated with the loop dimension being focalized;', (i) removing from a loop condition at another loop dimension of the selected loop nest a subcondition corresponding to the loop index; and', '(ii) removing from a memory access expression of an operand, a reference to the loop index; and, 'at least one of, 'storing the loop index and associated focalization information for that memory level; and, '(a) for each memory level in at least a subset of memory levels in the plurality of memory levels focalizing a loop dimension of a selected loop nest within a program, focalizing comprising adding an iterator based on a reintroduced loop index associated with the loop dimension being defocalized; and', updating a loop condition at another loop dimension of the selected loop nest based on the stored focalization information associated with the loop dimension being defocalized; and', 'updating the memory access expression of the operand based on the reintroduced loop index and the stored focalization information., 'at least one of], '(b) for each focalized dimension, defocalizing that dimension by2. ...

Подробнее
04-07-2019 дата публикации

ALTERNATIVE LOOP LIMITS FOR ACCESSING DATA IN MULTI-DIMENSIONAL TENSORS

Номер: US20190205756A1
Принадлежит:

Methods, systems, and apparatus for accessing a N-dimensional tensor are described. In some implementations, a method includes, for each of one or more first iterations of a first nested loop, performing iterations of a second nested loop that is nested within the first nested loop until a first loop bound for the second nested loop is reached. A number of iterations of the second nested loop for the one or more first iterations of the first nested loop is limited by the first loop bound in response to the second nested loop having a total number of iterations that exceeds a value of a hardware property of the computing system. After a penultimate iteration of the first nested loop has completed, one or more iterations of the second nested loop are performed for a final iteration of the first nested loop until an alternative loop bound is reached. 1. A method for performing computations based on tensor elements of an N-dimensional tensor , comprising: generating a first loop for controlling a number of iterations of a second loop used to traverse the particular dimension;', 'determining a first loop bound for the second loop and an alternative loop bound for the second loop based on the number of tensor elements of the particular dimension and the number of individual computing units of the computing system, wherein the first loop bound controls a number of iterations of the second loop for one or more first iterations of the first loop and the alternative loop bound controls the number of iterations of the second loop for a final iteration of the first loop such that the number of iterations of the second loop does not exceed a number of tensor elements that will be used to perform the computations; and', 'generating code that has the second loop nested within the first loop;, 'determining that a number of tensor elements of a particular dimension of the N-dimensional tensor is not an exact multiple of a number of individual computing units of the computing system ...

Подробнее
12-08-2021 дата публикации

SYSTEMS AND METHODS FOR OPTIMIZING NESTED LOOP INSTRUCTIONS IN PIPELINE PROCESSING STAGES WITHIN A MACHINE PERCEPTION AND DENSE ALGORITHM INTEGRATED CIRCUIT

Номер: US20210247981A1
Принадлежит:

In one embodiment, a method for improving a performance of an integrated circuit includes implementing one or more computing devices executing a compiler program that: (i) evaluates a target instruction set intended for execution by an integrated circuit; (ii) identifies one or more nested loop instructions within the target instruction set based on the evaluation; (iii) evaluates whether a most inner loop body within the one or more nested loop instructions comprises a candidate inner loop body that requires a loop optimization that mitigates an operational penalty to the integrated circuit based on one or more executional properties of the most inner loop instruction; and (iv) implements the loop optimization that modifies the target instruction set to include loop optimization instructions to control, at runtime, an execution and a termination of the most inner loop body thereby mitigating the operational penalty to the integrated circuit. 1. A method for improving a performance of an integrated circuit , the method comprising: (i) evaluates a target instruction set;', '(ii) identifies one or more loop instructions within the target instruction set based on the evaluation;', '(iii) evaluates whether an inner loop body within the one or more loop instructions comprises a candidate inner loop body requiring a loop optimization that mitigates an operational penalty to an integrated circuit based on one or more executional properties of the inner loop instruction; and', '(iv) implements the loop optimization that modifies the target instruction set to include loop optimization instructions to control an execution and a termination of the inner loop body thereby mitigating the operational penalty., 'implementing one or more computing devices executing a compiler program that2. A system for improving a performance of an integrated circuit , the system comprising: (i) evaluates a target instruction set;', '(ii) identifies one or more loop instructions within the target ...

Подробнее
20-08-2015 дата публикации

Execution control method and information processing apparatus

Номер: US20150234641A1
Автор: Yoshie Inada
Принадлежит: Fujitsu Ltd

While a first code, in an object code generated from a source code, for a loop included in the source code or a second code in the object code is executed, a feature amount concerning the number of times that a condition of a conditional branch is true is obtained. The loop includes the conditional branch, and the conditional branch is coded in the first code. The second code is a code to perform computation of a branch destination for a case where the condition of the conditional branch is true, only for loop indices that were extracted as the aforementioned case. Then, a processor executes, based on the feature amount, the second code or a third code included in the object code. The third code is a code to write, by using a predicated instruction and into a memory, any computation result of computations of branch destinations.

Подробнее
30-10-2014 дата публикации

Semi-Automatic Restructuring of Offloadable Tasks for Accelerators

Номер: US20140325495A1
Принадлежит: NEC Laboratories America, Inc.

A computer implemented method entails identifying code regions in an application from which offloadable tasks can be generated by a compiler for heterogenous computing system with processor and accelerator memory, including adding relaxed semantics to a directive based language in the heterogenous computing for allowing a suggesting rather than specifying a parallel code region as an offloadable candidate, and identifying one or more offloadable tasks in a neighborhood of code region marked by the directive. 1. A computer implemented method comprising:identifying code regions in an application from which offloadable tasks can be generated by a compiler for heterogenous computing system with processor and accelerator memory, comprising:adding relaxed semantics to a directive based language in the heterogenous computing for allowing a suggesting rather than specifying a parallel code region as an offloadable candidate; andidentifying one or more offloadable tasks in a neighborhood of code region marked by the directive.2. The method of claim 1 , wherein the offloadable candidate is a sub-offload.3. The method of claim 2 , wherein the sub-offload comprises only part of the code region marked by the directive is offloaded to the accelerator memory while the other part of the code region executes on the processor in parallel.4. The method of claim 2 , wherein the sub-offload comprises splitting an index range of a main parallel loop into two or more parts and declaring one of the subloops as the offloadable task.5. The method of claim 2 , wherein the sub-offload comprises handling reduction variables and critical sections across subloops without additional synchronization.6. The method of claim 2 , wherein the sub-offload comprises enabling concurrent execution of a task on the processor and accelerator memory.7. The method of claim 1 , wherein the offloadable candidate is a super-offload.8. The method of claim 7 , wherein the super-offload comprises declaring a code ...

Подробнее
16-08-2018 дата публикации

OPTIMIZE CONTROL-FLOW CONVERGENCE ON SIMD ENGINE USING DIVERGENCE DEPTH

Номер: US20180232239A1
Принадлежит:

There are provided a system, a method and a computer program product for selecting an active data stream (a lane) while running SPMD (Single Program Multiple Data) code on SIMD (Single Instruction Multiple Data) machine. The machine runs an instruction stream over input data streams. The machine increments lane depth counters of all active lanes upon the thread-PC reaching a branch operation. The machine updates the lane-PC of each active lane according to targets of the branch operation. The machine selects an active lane and activates only lanes whose lane-PCs match the thread-PC. The machine decrements the lane depth counters of the selected active lanes and updates the lane-PC of each active lane upon the instruction stream reaching a first instruction. The machine assigns the lane-PC of a lane with a largest lane depth counter value to the thread-PC and activates all lanes whose lane-PCs match the thread-PC. 1. A method for selecting an active data stream while running a SPMD (Single Program Multiple Data) program of instructions on a SIMD (Single Instruction Multiple Data) machine , an instruction stream having one thread-PC (Program Counter) , a value of the thread-PC indicating an instruction memory address which stores an instruction to be fetched next for the instruction stream , comprising;running the instruction stream over a plurality of input data streams each of which corresponds to each of a plurality of lanes, each of the plurality of lanes being associated with a corresponding lane depth counter indicating a number of nested branches for each corresponding lane, a corresponding lane-PC of a lane of the plurality of lanes indicating a memory address which stores the instruction to be fetched next for the lane when the lane is activated, and a lane activation bit indicating whether the lane is active or not;incrementing values of the lane depth counters of all active_lanes of the plurality of lanes upon the thread-PC value reaching a branch operation ...

Подробнее
17-09-2015 дата публикации

Interleaving data accesses issued in response to vector access instructions

Номер: US20150261512A1
Автор: Alastair David Reid
Принадлежит: ARM LTD

A vector data access unit includes data access ordering circuitry, for issuing data access requests indicated by elements of earlier and a later vector instructions, one being a write instruction. An element indicating the next data access for each of the instructions is determined. The next data accesses for the earlier and the later instructions may be reordered. The next data access of the earlier instruction is selected if the position of the earlier instruction's next data element is less than or equal to the position of the later instruction's next data element minus a predetermined value. The next data access of the later instruction may be selected if the position of the earlier instruction's next data element is higher than the position of the later instruction's next data element minus a predetermined value. Thus data accesses from earlier and later instructions are partially interleaved.

Подробнее
06-09-2018 дата публикации

COMPILING A PARALLEL LOOP WITH A COMPLEX ACCESS PATTERN FOR WRITING AN ARRAY FOR GPU AND CPU

Номер: US20180253289A1
Автор: Ishizaki Kazuaki
Принадлежит:

Computer-implemented methods are provided for compiling a parallel loop and generating Graphics Processing Unit (GPU) code, and Central Processing Unit (CPU) code for writing an array for the CPU and the CPU. A method includes compiling the parallel loop by (i) checking, based on a range of array elements to be written, whether the parallel loop can update all of the array elements and (ii) checking whether an access order of the array elements that the parallel loop reads or writes is known at compilation time. The method further includes determining an approach, from among a plurality of available approaches, to generate the CPU code and the GPU code based on (i) the range of the array elements to be written and (ii) the access order to the array elements in the parallel loop. 1. A computer-implemented method for compiling a parallel loop and generating Graphics Processing Unit (GPU) code and Central Processing Unit (CPU) code for writing an array for the GPU and the CPU , the method comprising:compiling the parallel loop by (i) checking, based on a range of array elements to be written, whether the parallel loop can update all of the array elements and (ii) checking whether an access order of the array elements that the parallel loop reads or writes is known at compilation time; anddetermining an approach, from among a plurality of available approaches, to generate the CPU code and the GPU code based on (i) the range of the array elements to be written and (ii) the access order to the array elements in the parallel loop.2. The computer-implemented method of claim 1 , wherein the GPU code and the CPU code are generated to be executable in parallel when regions of the array to be written are non-contiguous.3. The computer-implemented method of claim 1 , wherein (i) checking claim 1 , based on a range of array elements to be written claim 1 , whether the parallel loop can update all of the array elements comprises:checking whether the range of array elements to be ...

Подробнее
06-09-2018 дата публикации

COMPILING A PARALLEL LOOP WITH A COMPLEX ACCESS PATTERN FOR WRITING AN ARRAY FOR GPU AND CPU

Номер: US20180253290A1
Автор: Ishizaki Kazuaki
Принадлежит:

Computer-implemented methods are provided for compiling a parallel loop and generating Graphics Processing Unit (GPU) code and Central Processing Unit (CPU) code for writing an array for the GPU and the CPU. A method includes compiling the parallel loop by (i) checking, based on a range of array elements to be written, whether the parallel loop can update all of the array elements and (ii) checking whether an access order of the array elements that the parallel loop reads or writes is known at compilation time. The method further includes determining an approach, from among a plurality of available approaches, to generate the CPU code and the GPU code based on (i) the range of the array elements to be written and (ii) the access order to the array elements in the parallel loop. 1. A computer program product for compiling a parallel loop and generating Graphics Processing Unit (GPU) code and Central Processing Unit (CPU) code for writing an array for the GPU and the CPU , the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith , the program instructions executable by a computer to cause the computer to perform a method comprising:compiling the parallel loop by (i) checking, based on a range of array elements to be written, whether the parallel loop can update all of the array elements and (ii) checking whether an access order of the array elements that the parallel loop reads or writes is known at compilation time; anddetermining an approach, from among a plurality of available approaches, to generate the CPU code and the GPU code based on (i) the range of the array elements to be written and (ii) the access order to the array elements in the parallel loop.2. The computer program product of claim 1 , wherein the GPU code and the CPU code are generated to be executable in parallel when regions of the array to be written are non-contiguous.3. The computer program product of claim 1 , ...

Подробнее
14-09-2017 дата публикации

OPTIMIZATION OF LOOPS AND DATA FLOW SECTIONS IN MULTI-CORE PROCESSOR ENVIRONMENT

Номер: US20170262406A1
Автор: Vorbach Martin
Принадлежит: HYPERION CORE, INC.

The present invention relates to a method for compiling code for a multi-core processor, comprising: detecting and optimizing a loop, partitioning the loop into partitions executable and mappable on physical hardware with optimal instruction level parallelism, optimizing the loop iterations and/or loop counter for ideal mapping on hardware, chaining the loop partitions generating a list representing the execution sequence of the partitions. 1. A method for executing code on a processor , wherein the method comprises:detecting, by the processor, loop code in source code accessible to the processor, the loop code implementing one or more execution loops of the source code ;partitioning the loop code into a plurality of partitions by the processor;mapping, by the processor, each of the plurality of partitions, as a whole, onto a respective execution unit of an array of execution units for execution; andexecuting, by the processor, the plurality of partitions using the array of execution units.2. The method of wherein a group of the plurality of partitions comprises a sequence of multiple partitions claim 1 , wherein the multiple partitions are mapped and executed sequentially using the array of execution units.3. The method of wherein multiple partitions of the plurality of partitions are mapped and executed in parallel using the array of execution units.4. The method of wherein at least some of the plurality of partitions include a list of required resources according to which processor resources of the processor are allocated. This application is a continuation of U.S. patent application Ser. No. 14/693,793, filed Apr. 22, 2015, which is a continuation of U.S. patent application Ser. No. 13/519,887, filed Nov. 6, 2012 (now U.S. Pat. No. 9,043,769), which claims priority as a national phase application of International Patent Application No. PCT/EP2010/007950, filed Dec. 28, 2010, which claims priority to European Patent Application No. EP10007074.7, filed Jul. 9, ...

Подробнее
16-12-2021 дата публикации

Method and system for converting a single-threaded software program into an application-specific supercomputer

Номер: US20210389936A1
Принадлежит: Global Supercomputing Corporation

The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions. 1. A hardware component automatically compiled , by a compiler , from a single-threaded software program , where the instructions within the single-threaded software program include:a. at least one FIFO send instruction for sending a message in hardware over a group of pins of the hardware component, where the group of pins implements a sending FIFO interface; andb. at least one FIFO receive instruction for receiving a message in hardware over a group of pins of the hardware component, where the group of pins implements a receiving FIFO interface;where for the purpose of improving performance, the compiler automatically schedules and parallelizes the single-threaded software program including FIFO send and receive instruction, based on dependences imposed by register operands set or used by FIFO send and receive instructions and by register and memory operands set or used by remaining instructions; andwhere the hardware component is functionally equivalent to ...

Подробнее
12-09-2019 дата публикации

METHOD OF MEMORY ESTIMATION AND CONFIGURATION OPTIMIZATION FOR DISTRIBUTED DATA PROCESSING SYSTEM

Номер: US20190278573A1
Принадлежит:

The present invention relates to a method of memory estimation and configuration optimization for a distributed data processing system involves performing match between an application data stream and a data feature library, wherein the application data stream has received analysis and processing on conditional branches and/or loop bodies of an application code in a Java archive of the application, estimating a memory limit for at least one stage of the application based on the successful matching result, optimizing configuration parameters of the application accordingly, and acquiring static features and/or dynamic features of the application data based on running of the optimized application and performing persistent recording. Opposite to machine-learning-based memory estimation that does not ensure accuracy and fails to provide fine-grained estimation for individual stages, this method uses application analysis and existing data feature to estimate overall memory occupation more precisely and to estimate memory use of individual job stages for more fine-grained configuration optimization. 1. A method of memory estimation and configuration optimization for a distributed data processing system , wherein the method comprises the steps of:performing a match between an application data stream and a data feature library, wherein the application data stream has received analysis and processing on conditional branches and loop bodies of an application code in a Java archive of the application;estimating a memory limit for at least one stage of the application based on a successful result of the match:optimizing configuration parameters of the application based on the estimated memory limit;acquiring static features and dynamic features of the application data based on running of the optimized application and performing persistent recording;estimating a memory limit of the at least one stage of the application again based on a feedback result of the static features and ...

Подробнее
12-09-2019 дата публикации

VECTORIZE STORE INSTRUCTIONS METHOD AND APPARATUS

Номер: US20190278577A1
Принадлежит:

Methods, apparatus, and system to optimize compilation of source code into vectorized compiled code, notwithstanding the presence of output dependencies which might otherwise preclude vectorization. 1. An apparatus for computing , comprising:a computer processor and a memory;a compilation optimization module to optimize compilation of the source code, wherein to optimize compilation of the source code, the compilation optimization module is to determine that a loop or function in the source code comprises mutually dependent store instructions; anda vectorization module to vectorize a set of mutually dependent store instructions in the loop, wherein to vectorize the set of mutually dependent store instructions, the vectorization module is to determine a scalar store order for the set of mutually dependent store instructions, determine a vectorized store order for the scalar store order and at least one scatter instruction to store a result of the vectorized store order to a set of non-contiguous or random locations in a target memory.2. The apparatus according to claim 1 , wherein determine the vectorized store order for the scalar store order comprises determine the vectorized store order for the scalar store order based on a number of vector elements in a vector register coupled to a target computer processor and exclude a no-operation store instruction from the vectorized store order.3. The apparatus according to claim 2 , wherein a scalar matrix comprising a number of sequential scalar instruction iterations and a number of sequential store instructions in each iteration in a number of sequential scalar instruction iterations has a different size than a vector matrix comprising the number of elements executed by a SIMD instruction using the vector register.4. The apparatus according to claim 1 , wherein determine the vectorized store order for the scalar store order further comprises transpose each store instruction in the set of mutually dependent store ...

Подробнее
03-09-2020 дата публикации

Method and system for converting a single-threaded software program into an application-specific supercomputer

Номер: US20200278848A1
Принадлежит: Global Supercomputing Corp

The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions.

Подробнее
18-10-2018 дата публикации

Partial connection of iterations during loop unrolling

Номер: US20180300113A1
Принадлежит: International Business Machines Corp

A method and system for partial connection of iterations during loop unrolling during compilation of a program by a compiler. Unrolled loop iterations of a loop in the program are selectively connected during loop unrolling during the compilation, including redirecting, to the head of the loop, undesirable edges of a control flow from one iteration to a next iteration of the loop. Merges on a path of hot code are removed to increase a scope for optimization of the program. The head of the loop and a start of a replicated loop body of the loop are equivalent points of the control flow.

Подробнее
09-11-2017 дата публикации

METHODS AND SYSTEMS TO VECTORIZE SCALAR COMPUTER PROGRAM LOOPS HAVING LOOP-CARRIED DEPENDENCES

Номер: US20170322786A1
Принадлежит:

Methods and systems to convert a scalar computer program loop having loop-carried dependences into a vector computer program loop are disclosed. One such method includes, at runtime, identifying, by executing an instruction with one or more processors, a first loop iteration that cannot be executed in parallel with a second loop iteration due to a set of conflicting scalar loop operations. The first loop iteration is executed after the second loop iteration. The method also includes sectioning, by executing an instruction with one or more processors, a vector loop into vector partitions including a first vector partition. The first vector partition executes consecutive loop iterations in parallel and the consecutive loop iterations start at the second loop iteration and end before the first loop iteration. 1. A method to convert a scalar computer program loop having loop-carried dependences to a vector computer program loop , the method comprising:at runtime, identifying, by executing an instruction with one or more processors, a first loop iteration that cannot be executed in parallel with a second loop iteration due to a set of conflicting scalar loop operations, the first loop iteration being executed after the second loop iteration; andsectioning, by executing an instruction with one or more processors, a vector loop into vector partitions including a first vector partition, the first vector partition to execute consecutive loop iterations in parallel, the consecutive loop iterations to start at the second loop iteration and to end before the first loop iteration.2. The method defined in claim 1 , wherein the consecutive loop iterations are a first set of consecutive loop iterations claim 1 , and the vector partitions include a second vector partition to execute a second set of consecutive loop iterations in parallel claim 1 , the second set of consecutive loop iterations to start at the first loop iteration and to end before a third loop iteration.3. The method ...

Подробнее
01-10-2020 дата публикации

Vehicle entertainment system interactive user interface co-development environment

Номер: US20200310764A1
Автор: Fouzi Djaafri
Принадлежит: Panasonic Avionics Corp

A co-development platform for a graphical user interface to a system interaction application of a terminal device for a vehicle entertainment system. A terminal hardware emulator has an emulated data processor executing software instructions of the system interaction application. A display terminal emulator generates an output corresponding to the graphical user interface, and is selectively targeted to device parameters specific to the terminal device. The display terminal emulator is also receptive to test inputs to one or more input-receptive graphic elements in the graphical user interface. A graphical user interface editor includes one or more interface element settings modifiable by a test user. A test user access controller defines access privilege levels for the test users, which are selectively restricted and permitted to access one or more functionalities.

Подробнее
08-12-2016 дата публикации

Parallel computing apparatus and parallel processing method

Номер: US20160357529A1
Автор: Yuji Tsujimori
Принадлежит: Fujitsu Ltd

Code includes a loop including update processing for updating elements of an array, indicated by a first index, and reference processing for referencing elements of the array, indicated by a second index. At least one of the first index and the second index depends on a parameter whose value is determined at runtime. A processor calculates, based on the value of the parameter determined at runtime, a first range of the elements to be updated by the update processing and a second range of the elements to be referenced by the reference processing prior to the execution of the loop. Then, the processor compares the first range with the second range and outputs a warning indicating that the loop is not parallelizable when the first range and the second range overlap in part.

Подробнее
20-12-2018 дата публикации

Alternative loop limits

Номер: US20180365561A1
Принадлежит: Google LLC

Methods, systems, and apparatus for accessing a N-dimensional tensor are described. In some implementations, a method includes, for each of one or more first iterations of a first nested loop, performing iterations of a second nested loop that is nested within the first nested loop until a first loop bound for the second nested loop is reached. A number of iterations of the second nested loop for the one or more first iterations of the first nested loop is limited by the first loop bound in response to the second nested loop having a total number of iterations that exceeds a value of a hardware property of the computing system. After a penultimate iteration of the first nested loop has completed, one or more iterations of the second nested loop are performed for a final iteration of the first nested loop until an alternative loop bound is reached.

Подробнее
27-12-2018 дата публикации

LOOP EXECUTION WITH PREDICATE COMPUTING FOR DATAFLOW MACHINES

Номер: US20180373509A1
Принадлежит:

Compilers for compiling computer programs and apparatuses including compilers are disclosed herein. A compiler may include one or more analyzers to parse and analyze source instructions of a computer program including identification of nested loops of the computer program. The compiler may also include a code generator coupled to the one or more analyzers to generate and output executable code for the computer program that executes on a data flow machine, including a data flow graph, based at least in part on results of the analysis. In embodiments, the executable code may include executable code that recursively computes predicates of identified nested loops for use to generate control signal for the data flow graph to allow execution of each loop to start when the loop's predicate is available, independent of whether any other loop is in execution or not. Other embodiments may be disclosed or claimed. 1. A compiler for compiling a computer program , comprising:one or more analyzers to parse and analyze source instructions of a computer program including identification of nested loops of the computer program; anda code generator coupled to the one or more analyzers to generate and output executable code for the computer program that executes on a data flow machine, including a data flow graph, based at least in part on results of the analysis, wherein the executable code includes executable code that recursively computes predicates of identified nested loops for use to generate control signal for the data flow graph to allow execution of each loop to start when the loop's predicate is available, independent of whether any other loop is in execution or not.2. The compiler of claim 1 , wherein the computed predicate of an identified loop is a function of the loop's initial predicate (Pinitial) and the predicate of its backedge (Pbackedge).3. The compiler of claim 2 , wherein the predicate of the identified loop is computed as: Pbackedge∥Pinitial claim 2 , where ∥ is ...

Подробнее
26-11-2020 дата публикации

LOOP NEST REVERSAL

Номер: US20200371763A1
Принадлежит:

Systems, apparatuses and methods may provide for technology to identify in user code, a nested loop which would result in cache memory misses when executed. The technology further reverses an order of iterations of a first inner loop in the nested loop to obtain a modified nested loop. Reversing the order of iterations increases a number of times that cache memory hits occur when the modified nested loop is executed. 125-. (canceled)26. A code utility system , comprising:a design interface to receive user code;a cache memory; anda compiler coupled to the design interface and the cache memory, the compiler to:identify in the user code, a nested loop which would result in cache memory misses of the cache memory when executed, andreverse an order of iterations of a first inner loop in the nested loop to obtain a modified nested loop, wherein reverse the order of iterations increases a number of times that cache memory hits occur when the modified nested loop is executed.27. The system of claim 26 , wherein the compiler is to unroll an outer loop of the nested loop by a factor claim 26 ,wherein the unroll is to modify an inner loop of the nested loop based upon the factor into the first inner loop and a second inner loop.28. The system of claim 26 , wherein to identify the nested loop claim 26 , the compiler is to identify an inner loop of the nested loop as accessing data which is invariant across iterations of an outer loop of the nested loop.29. The system of claim 26 , wherein to identify the nested loop claim 26 , the compiler is to identify an inner loop of the nested loop as accessing a total size of data which exceeds an available memory of the cache memory.30. The system of claim 26 , wherein to identify the nested loop claim 26 , the compiler is to identify that an inner loop of the nested loop does not have loop-carried dependencies.31. The system of claim 26 , wherein the compiler is to modify the user code by executing an unroll and jam on the nested loop. ...

Подробнее
19-12-2019 дата публикации

INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND COMPUTER READABLE MEDIUM

Номер: US20190384687A1
Принадлежит: Mitsubishi Electric Corporation

A processing dividing unit () extracts, from a function model () including one or more loop processes, each of the one or more loop processes. A parameter extracting unit () determines the characteristics of each extracted loop process. A performance calculation basic formula selecting unit () selects, for each loop process, from a plurality of processing time calculation procedures for calculating a processing time, a processing time calculation procedure for calculating a processing time of each loop process, based on the characteristics of each loop process and the architecture of computational resources executing the function model (). A performance estimating unit () calculates a processing time of each loop process by using a corresponding processing time calculation procedure selected by the performance calculation basic formula selecting unit (). 1. An information processing device comprising:processing circuitry to:extract, from a program including one or more loop processes, each of the one or more loop processes;determine characteristics of each loop process extracted;select, for each loop process, from a plurality of processing time calculation procedures for calculating a processing time, a processing time calculation procedure for calculating a processing time of each loop process, based on the characteristics of each loop process determined and architecture of computational resources executing the program; andcalculate a processing time of each loop process by using a corresponding processing time calculation procedure selected.2. The information processing device according to claim 1 ,wherein the processing circuitryselects, for each loop process, from a plurality of memory access delay time calculation procedures for calculating a memory access delay time, a memory access delay time calculation procedure for calculating a memory access delay time in each loop process, based on the architecture of computational resources executing the program, ...

Подробнее
03-11-2022 дата публикации

METHOD FOR CONTROLLING THE FLOW EXECUTION OF A GENERATED SCRIPT OF A BLOCKCHAIN TRANSACTION

Номер: US20220350579A1
Принадлежит:

A method and system for generating a transaction for a blockchain protocol are disclosed. The method comprises using a software resource to receive, generate, or derive at least one data item, insert, at least once, a portion of code into a script associated with the transaction, where the script is written in a language that is functionally restricted. Upon execution of the script, the portion of code provides functionality of a control flow mechanism controlled or influenced by the at least one data item. The method further comprises using the software resource to generate the blockchain transaction comprising the script and submit the blockchain transaction to a blockchain network. 1. A computer-implemented method , comprising the steps of:receiving, generating or deriving at least one data item received from an off-chain source and generated by a user of a random or pseudo-random number generator; andinserting, at least once, at least one portion of code into a script associated with a blockchain transaction, wherein the script is written in a language that is functionally restricted such that the script, upon execution, provides the functionality of a control flow mechanism, the behaviour of the control flow mechanism being controlled or influenced by the at least one data item;generating the blockchain transaction comprising the script; andsubmitting the blockchain transaction to a blockchain network.2. The computer-implemented method according to wherein:the script is associated with an input or output of the blockchain transaction.3. The computer-implemented method according to wherein:the blockchain transaction is generated in accordance with, or for use with, a blockchain protocol.4. The computer-implemented method according to wherein the protocol is the Bitcoin protocol or a variant of the Bitcoin protocol.56-. (canceled)7. The computer-implemented method according to wherein:the control flow mechanism is a loop, such as a for loop, or a selection ...

Подробнее
10-11-2022 дата публикации

Computer Implemented Program Specialization

Номер: US20220357933A1
Принадлежит:

A computerized technique for program simplification and specialization combines a partial interpretation of the program based on a subset of program functions to obtain variable states with concrete values at a program “neck.” These concrete values are then propagated as part of an optimization transformation that simplifies the program based on these constant values, for example, by eliminating branches that are never taken based on the constant values. 1. An apparatus for producing compact program versions comprising:at least one computer processor; anda memory coupled to the at least one processor holding a stored program executable by the at least one computer processor to:(a) receive a program implementing multiple functions and a description of a desired subset of functions less than the set of the multiple functions;(b) identify a neck of the program dividing configuration instructions from main logic instructions;(c) partially interpret the program to the neck to establish concrete values of variables at the neck;(d) propagate the concrete values through the main logic instructions; and(e) simplify the program by removing instructions of the main logic instructions that will never execute based on the propagated concrete values.2. The apparatus of wherein (c) uses symbolic execution up to the neck to establish concrete representations of the variable states claim 1 , and (d) uses the concrete representations and the desired subset of functions to perform the constant conversion.3. The apparatus of wherein (e) performs optimizing transformations using the concrete values.4. The apparatus of wherein the optimizing transformations employ at least one of loop unrolling and function in-lining.5. The apparatus of wherein the removed instructions include instruction branches conditioned on expressions which will never be executed based on the propagated concrete values.6. The apparatus of wherein the program is parameterized by command-line switch inputs claim 1 , ...

Подробнее
31-01-2017 дата публикации

Apparatus and method for dynamically determining the execution mode of a reconfigurable array

Номер: KR101700406B1
Принадлежит: 삼성전자주식회사

재구성 가능 어레이의 실행 모드를 동적으로 결정하기 위한 장치 및 방법이 제공된다. 본 발명의 일 양상에 따르면, 어떤 루프의 실행 전에 또는 실행 도중에 루프에 관한 성능 정보가 얻어진다. 성능 정보는 루프를 VLIW 모드에서 실행했을 때와 CGA 모드에서 실행했을 때 어떤 것이 유리한지를 나타낸다. 성능 정보에서 변수는 루프의 루프 반복 횟수이다. 그 루프의 루프 반복 횟수가 정해지면 성능 정보에 기초하여 유리한 모드를 선택한다. 만약, 루프의 루프 반복 회수가 정해지지 않는 경우 루프의 실행 시간에 대한 예측값을 이용하여 유리한 모드를 선택한다. 예측값은 루프의 실행 시간을 측정하고 측정된 값과 이전의 예측값에 가중치를 적용하여 누적적으로 합한 값을 이용할 수 있다. An apparatus and method for dynamically determining an execution mode of a reconfigurable array is provided. According to one aspect of the present invention, performance information about a loop is obtained before or during execution of a loop. The performance information indicates what is advantageous when the loop is run in VLIW mode and in CGA mode. In the performance information, the variable is the number of loop iterations of the loop. If the number of loop iterations of the loop is determined, an advantageous mode is selected based on the performance information. If the number of loop iterations of the loop is not determined, a favorable mode is selected by using the predicted value of the execution time of the loop. The predicted value can be obtained by cumulatively adding a weight to the measured value and the previous predicted value by measuring the execution time of the loop.

Подробнее
30-08-2011 дата публикации

Parallelizing sequential frameworks using transactions

Номер: US8010550B2
Принадлежит: Microsoft Corp

Various technologies and techniques are disclosed for transforming a sequential loop into a parallel loop for use with a transactional memory system. A transactional memory system is provided. A first section of code containing an original sequential loop is transformed into a second section of code containing a parallel loop that uses transactions to preserve an original input to output mapping. For example, the original sequential loop can be transformed into a parallel loop by taking each iteration of the original sequential loop and generating a separate transaction that follows a pre-determined commit order process. At least some of the separate transactions are executed in different threads. When an unhandled exception is detected that occurs in a particular transaction while the parallel loop is executing, state modifications made by the particular transaction and predecessor transactions are committed, and state modifications made by successor transactions are discarded.

Подробнее
16-09-2003 дата публикации

Parallel program generating method

Номер: US6622301B1
Принадлежит: HITACHI LTD

When converting a sequential execution source program into a parallel program to be executed by respective processors (nodes) of a distributed shared memory parallel computer, a compiler computer transforms the source program to increase a processing speed of the parallel program. First, a kernel loop having a longest sequential execution time is detected in the source program. Next, a data access pattern equal to that of the kernel loop is reproduced to generate a control code to control first touch data distribution. The first touch control code generated is inserted in the parallel program.

Подробнее
28-03-2019 дата публикации

Loop nest reversal

Номер: WO2019059927A1
Принадлежит: Intel Corporation

Systems, apparatuses and methods may provide for technology to identify in user code, a nested loop which would result in cache memory misses when executed. The technology further reverses an order of iterations of a first inner loop in the nested loop to obtain a modified nested loop. Reversing the order of iterations increases a number of times that cache memory hits occur when the modified nested loop is executed.

Подробнее
01-08-2018 дата публикации

Staged loop instructions

Номер: EP2680132B1
Принадлежит: Analog Devices Inc

Подробнее
04-05-1999 дата публикации

Method and system for optimizing code

Номер: US5901318A
Автор: Wei Chung Hsu
Принадлежит: Hewlett Packard Co

An optimizing compiler for optimizing code in a computer system having a CPU and a memory. The code has a loop wherein the loop includes statements conditionally executed depending on the evaluation of a control flow statement. The inventive compiler separates the code into a index collection phase and an execution phase. The index collection phase collects array indices indicating whether the control flow statement evaluates true for each particular loop iteration. The execution phase builds self loops without conditional statements. The self loops use the array indices to execute only the loop instructions that should be executed. Since those instruction are predetermined by the index collection phase, performance enhancement features of the CPU, such as branch prediction, pipelining, and a superscalar architecture can be fully exploited.

Подробнее
07-08-2001 дата публикации

Method and apparatus for finding loop— lever parallelism in a pointer based application

Номер: US6272676B1
Принадлежит: Intel Corp

A method and apparatus for finding loop_level parallelism in a sequence of instructions. In one embodiment, the method includes the steps of determining if a variable which identifies a memory address of a data structure is an induction variable; and determining if execution of the sequence of instructions terminates in response to a comparison of the variable to an invariant value. If the two conditions of the present invention are found to be true, the respective sequence of instructions is a candidate to be flagged for multi-threading execution, assuming the loop of the instructions terminates.

Подробнее
23-02-2016 дата публикации

Methods and systems to vectorize scalar computer program loops having loop-carried dependences

Номер: US9268541B2
Принадлежит: Intel Corp

Methods and systems to convert scalar computer program loops having loop carried dependences to vector computer program loops are disclosed. One example method and system generates a first predicate set associated with a first conditionally executed statement. The first predicate set contains a first set of predicates that cause a variable to be defined in a scalar computer program loop at or before the variable is defined by the first conditionally executed statement. The method and system also generates a second predicate set associated with the first conditionally executed statement. The second predicate set contains a second set of predicates that cause the variable to be used in the scalar computer program loop at or before the variable is defined by the first conditionally executed statement. The method and system determines whether the second predicate set is a subset of the first predicate set and, based on the determination, propagates a vector value in an element of a vector of the variable to a subsequent element of the vector.

Подробнее
19-09-2002 дата публикации

Hardware supported software pipelined loop prologue optimization

Номер: US20020133813A1
Автор: Alexander Ostanevich
Принадлежит: Elbrus International

A method for optimizing a software pipelineable loop in a software code is provided. The loop comprises one or more pipelined stages and one or more loop operations. The method comprises evaluating an initiation interval time (IN) for a pipelined stage of the loop. A loop operation time latency (Tld) and a number of loop operations (Np) from the pipelined stages to peel based on IN and Tld is then determined. The loop operation is peeled Np times and copied before the loop in the software code. A vector of registers is allocated and the results of the peeled loop operations and a result of an original loop operation is assigned to the vector of registers. Memory addresses for the results of the peeled loop operations and original loop operation are also assigned.

Подробнее
16-07-2014 дата публикации

Parallelizing sequential frameworks using transactions

Номер: CN101681272B
Принадлежит: Microsoft Corp

公开了用于将顺序循环转换成并行循环以与事务存储器系统一起使用的各种技术和方法。提供了一种事务存储器系统。将包含原始顺序循环的第一部分代码转换成包含使用事务来保留原始的输入到输出映射的并行循环的第二部分代码。例如,可以通过取原始顺序循环的每一迭代并生成遵循预定提交次序过程的单独事务来将原始顺序循环转换成并行循环。各单独事务中的至少某一些在不同的线程中执行。当在执行并行循环时检测到在特定事务中发生了未经处理的异常时,则提交该特定事务和前导事务所作出的状态修改并丢弃后续事务所作出的状态修改。

Подробнее
11-12-2016 дата публикации

Patent TWI562065B

Номер: TWI562065B
Автор: Mitsuru Mushano
Принадлежит: Mush A Co Ltd

Подробнее
18-05-2018 дата публикации

It is used to implement the method and apparatus of the code release control of transaction internal memory region promotion

Номер: CN104572260B
Принадлежит: Globalfoundries Inc

本发明涉及用于实现事务内存区域提升的代码版本控制的方法和设备。一种用于实现事务内存区域提升的代码版本控制的计算机实现的处理的说明性实施例接收一部分候选源代码,并且概括接收的该部分候选源代码以用于并行执行。所述计算机实现的处理还利用进入和离开例程包裹关键区域以进入推测子处理,其中进入和离开例程还在运行时搜集冲突统计数据。执行概括代码部分以根据在运行时搜集的冲突统计数据确定使用多个循环版本中的特定循环版本。

Подробнее
10-04-2018 дата публикации

Reuse of decoded instructions

Номер: US9940136B2
Принадлежит: Microsoft Technology Licensing LLC

Systems and methods are disclosed for reusing fetched and decoded instructions in block-based processor architectures. In one example of the disclosed technology, a system includes a plurality of block-based processor cores and an instruction scheduler. A respective core is capable of executing one or more instruction blocks of a program. The instruction scheduler can be configured to identify a given instruction block of the program that is resident on a first processor core of the processor cores and is to be executed again. The instruction scheduler can be configured to adjust a mapping of instruction blocks in flight so that the given instruction block is re-executed on the first processor core without re-fetching the given instruction block.

Подробнее
20-08-2021 дата публикации

Hardware acceleration method, compiler and equipment

Номер: CN110874212B
Принадлежит: Huawei Technologies Co Ltd

本发明实施例公开了一种硬件加速方法、编译器和设备,用于提高代码执行效率从而实现硬件加速。本发明实施例方法包括:编译器获取编译策略信息以及源代码;所述编译策略信息指示第一代码类型与第一处理器匹配,第二代码类型与第二处理器匹配;所述编译器根据所述编译策略信息分析所述源代码中的代码段,确定属于第一代码类型的第一代码段或属于第二代码类型的第二代码段;所述编译器将第一代码段编译为第一可执行代码,将所述第一可执行代码发往所述第一处理器;将第二代码段编译为第二可执行代码,将所述第二可执行代码发往所述第二处理器。

Подробнее
13-07-2017 дата публикации

Program conversion device, program conversion method, and recording medium having program for program conversion recorded therein

Номер: WO2017119378A1
Автор: 孝道 宮本
Принадлежит: 日本電気株式会社

Provided is a program conversion device, and the like, capable of, even if a program has a loop including a termination determining process, converting the program to a program which can be executed at high speed by an information processing device including a vector operating unit. A program conversion device 501 comprises: a program conversion unit 502 that converts a program including a second loop process in which a first loop process for executing first processes in a certain order is repeated a plurality of times, into a third loop process in which a repetition process of the second loop process is repeated a first number of times and a fourth loop process which is repeated a second number of times in the third loop process, when the first loop process includes a process of accessing, in the first process, storage regions not continuous with respect to the order and the second loop process includes a determination process of ending the second loop process in accordance with whether or not a condition is satisfied; a loop division unit 503 that converts a process in which the first loop process is repeated the second number of times and a process in which the determination process is repeated the second number of times, into a process of repeating said processes the first number of times; a variable rearrange unit 505 that converts the first process and the determination process to a process for accessing storage regions which are different for each of the fourth loop processes and are in sequence in the processing order of the fourth loop processes; and a process exchange unit 504 that exchanges the processing orders of the fourth loop process and the first loop process.

Подробнее
11-08-1998 дата публикации

Architectural support for execution control of prologue and eplogue periods of loops in a VLIW processor

Номер: US5794029A
Принадлежит: Elbrus International Ltd

For certain classes of software pipelined loops, prologue and epilogue control is provided by loop control structures, rather than by predicated execution features of a VLIW architecture. For loops compatible with two simple constraints, code elements are not required for disabling garbage operations during prologue and epilogue loop periods. As a result, resources associated with implementation of the powerful architectural feature of predicated execution need not be squandered to service loop control. In particular, neither increased instruction width nor an increased number of instructions in the loop body is necessary to provide loop control in accordance with the present invention. Fewer service functions are required in the body of a loop. As a result, loop body code can be more efficiently scheduled by a compiler and, in some cases, fewer instructions will be required, resulting in improved loop performance. Loop control logic includes a loop control registers having an epilogue counter field, a shift register, a side-effects enabled flag, a current loop counter field, a loop mode flag, and side-effects manual control and loads manual control flags. Side-effects enabling logic and load enabling logic respectively issue a side-effects enabled predicate and a loads enabled predicate to respective subsets of execution units. Software pipelined simple and inner loops are supported.

Подробнее
01-01-2014 дата публикации

Staged loop instructions

Номер: EP2680132A2
Принадлежит: Analog Devices Inc

Loop instructions are analyzed and assigned stage numbers based on dependencies between them and machine resources available. The loop instructions are selectively executed based on their stage numbers, thereby eliminating the need for explicit loop set-up and tear-down instructions. On a Single Instruction, Multiple Data machine, the final instance of each instruction may be executed on a subset of the processing elements or vector elements, dependent on the number of iterations of the original loop.

Подробнее
13-03-2020 дата публикации

Information processing apparatus, compile program, compile method, and cache control method

Номер: JP6665720B2
Автор: 優太 向井
Принадлежит: Fujitsu Ltd

Подробнее
05-09-2007 дата публикации

Compiling method, compiling apparatus and computer system for a loop in a program

Номер: EP1828889A1
Автор: Fan Wu, Yanmeng Sun
Принадлежит: KONINKLIJKE PHILIPS ELECTRONICS NV

A method for compiling a program including a loop is provided. In the program, the loop includes K instructions (K>2) and repeats for M times (M>2). The compiling method comprises following steps: performing resource conflict analysis to the K instructions in the loop; dividing the K instructions in the loop into a first combined instruction section, a connection instruction section and a second combined instruction section, wherein there is no resource conflict between the instructions in the first combined instruction section and the instructions in the second combined instruction section respectively; and compiling the program, wherein the instructions in the first combined instruction section in the cycle N (N=2, 3, ...M) and the instructions in the second combined instruction section in the cycle N-I are combined to be compiled respectively. A compiling apparatus and a computer system for realizing the above-mentioned compiling method are further provided.

Подробнее
11-05-2015 дата публикации

Data processing apparatus, data processing system, data structure, recording medium, storage device, and data processing method

Номер: JPWO2013118754A1
Принадлежит: 株式会社Mush−A

本発明は、ループ処理においてボトルネックを解消し、高速に並列処理することを目的とし、複数の処理部は、拡張識別情報の少なくとも一部に基づいて算出される宛先情報が当該処理部を示すパケットのみを取得する入出力部と、入出力部によって取得されたパケットの処理命令のうち最初に実行されるべき処理命令を実行し、当該実行によって生成されるデータに、実行された処理命令の次に実行されるべき処理命令を最初に実行されるべき処理命令とする拡張識別情報が付加されたパケットを生成して入出力部に入力する演算部と、最初に実行されるべき処理命令が複数のパケットからなるパケット群を生成する処理命令である場合に、パケット群を生成するためのテンプレート情報が登録されるテンプレート記憶部と、テンプレート記憶部に登録されているテンプレート情報に基づいてパケット群を生成して入出力部に入力するパケット生成部と、をそれぞれ有する。   An object of the present invention is to eliminate bottlenecks in loop processing and perform parallel processing at high speed, and a plurality of processing units indicate destination information calculated based on at least part of extended identification information. An input / output unit that acquires only a packet and a processing instruction to be executed first among the processing instructions of the packet acquired by the input / output unit are executed, and the executed processing instruction is added to the data generated by the execution. An operation unit that generates a packet with extended identification information added to a processing instruction to be executed first as a processing instruction to be executed first and inputs the packet to an input / output unit; and a processing instruction to be executed first A template storage unit in which template information for generating a packet group is registered in the case of a processing instruction for generating a packet group including a plurality of packets; and a template It has a packet generator for inputting to the input-output unit to generate a packet group based on the template information registered in 憶部, respectively.

Подробнее
26-11-2014 дата публикации

Method for optimising the parallel processing of data on a hardware platform

Номер: EP2805234A1
Принадлежит: Thales SA

The invention relates to a method for optimising the parallel processing of data on a hardware platform comprising at least one calculation unit comprising a plurality of processing units capable of executing a plurality of executable tasks in parallel, wherein all the data to be processed is broken down into subsets of data, a same sequence of operations being carried out on each subset of data. The method of the invention comprises obtaining (50, 52) the maximum number of subsets of data to be processed by a same sequence of operations, and a maximum number of tasks that can be executed in parallel by a calculation unit of the hardware platform, determining (54) at least two processing partitions, each of said processing partitions corresponding to the partition of all the data into a number of data groups, and to the assignment of at least one executable task, capable of executing said sequence of operations, to each subset of data from said data group, and selecting (60, 62) the processing partition that makes it possible to obtain an optimal measurement value depending on a predetermined criterion. Programming code instructions implementing said selected processing partition are then obtained. One use of the method of the invention is the selection of an optimal hardware platform according to a measurement of execution performance.

Подробнее
22-02-2017 дата публикации

Extracting system architecture in high level synthesis

Номер: CN106462431A
Принадлежит: Xilinx Inc

在高级综合中提取系统架构包括确定高级编程语言描述的第一函数和高级编程描述的控制流构造中包含的第二函数(210、215、220)。第二函数被确定为第一函数的数据消费函数(225)。在电路设计中,自动生成包括本地存储器的端口(240)。该端口在电路设计中将实施第一函数的第一电路模块和实施第二函数的第二电路模块相耦接。

Подробнее
05-02-2015 дата публикации

Method and system for automated improvement of parallelism in program compilation

Номер: AU2013290313A1
Автор: Loring CRAYMER
Принадлежит: Individual

A method of program compilation to improve parallelism during the linking of the program by a compiler. The method includes converting statements of the program to canonical form, constructing abstract system tree (AST) for each procedure in the program, and traversing the program to construct a graph by making each non-control flow statement and each control structure into at least one node of the graph.

Подробнее
22-06-2016 дата публикации

Parallelization and loop optimization method and system for a high-level language of reconfigurable processor

Номер: CN105700933A
Автор: 何卫锋, 田丰硕, 绳伟光
Принадлежит: Shanghai Jiaotong University

本发明提供了一种可重构处理器的高级语言的并行化和循环优化方法与系统,针对通用可重构处理器提出了一套端对端的语言转化系统,对于可重构处理器,计算密集型应用中的核心循环需要通过可重构部分进行并行计算,使得C语言不能满足他的并行特性,所以需要将应用程序中的串行部分和并行部分分别封装,并且根据系统特性来进行优化,最终生成一套新型的语言;在确定kernel函数的输入输出的数据类型和长度时,采用了编写decls.h的方法,简化了系统的复杂程度,并且使得系统的适用性大为提高;在进行循环优化的过程中,利用了多面体模型进行处理,使得系统适用性更加广泛,系统在不同的架构上的移植更加简单。

Подробнее
23-08-2018 дата публикации

Information processing device, information processing method, and information processing program

Номер: WO2018150588A1
Принадлежит: 三菱電機株式会社

A process dividing unit (130) extracts each of the one or more loop processes included in a functional model (210). A parameter extraction unit (140) determines characteristics of each extracted loop process. On the basis of the characteristics of each loop process and on the basis of a computational resource architecture for implementing the functional model (210), a performance calculation basic formula selection unit (150) selects, from among a plurality of processing time calculation procedures for calculating processing time, a processing time calculation procedure for calculating the processing time required for each loop process. A performance estimation unit (160) calculates the processing time required for each loop process using the processing time calculation procedure selected for the loop process by the performance calculation basic formula selection unit (150).

Подробнее
16-03-2018 дата публикации

The reuse of the instruction of decoding

Номер: CN107810477A
Автор: A·史密斯, D·C·巴格
Принадлежит: Microsoft Technology Licensing LLC

公开了用于重复使用基于块的处理器架构中的提取的和解码的指令的系统和方法。在所公开的技术的一个示例中,一种系统包括多个基于块的处理器核心和指令调度器。相应的核心能够运行程序的一个或多个指令块。指令调度器能够被配置为标识驻留在处理器核心中的第一处理器核心上并且要被再次运行的程序的给定指令块。指令调度器能够被配置为在运行中调整指令块的映射,使得在没有重新提取给定指令块的情况下,给定指令块被重新运行在第一处理器核心上。

Подробнее
22-06-1999 дата публикации

Array summary analyzing method for loop containing skip-out sentence

Номер: JPH11167492A
Принадлежит: HITACHI LTD

(57)【要約】 【課題】ソースプログラムから目的プログラムを生成す る言語処理系において、ループ飛び出し文を含むループ に対して配列サマリ解析の精度を向上し、配列プライベ ート化の適用性を向上する。 【解決手段】ループ飛び出し文およびループ飛び出し時 のループ制御変数の値をスカラ変数に設定する文がルー プ内に含まれる場合には、ループの本体の配列サマリに おいてループ制御変数の上限をこのスカラ変数で置き換 え、この結果から変数消去法でループ制御変数を消去し た結果をループの配列サマリとすることにより、近似無 しで配列サマリを計算する。 【効果】ループ飛び出し時のループ制御変数の値をスカ ラ変数に設定する文を含むループに対して、配列プライ ベート化の適用性が向上する。

Подробнее
05-07-2007 дата публикации

Statement shifting to increase parallelism of loops

Номер: US20070157184A1
Принадлежит: Intel Corp

A method for statement shifting to increase the parallelism of loops includes constructing a data dependence graph (DDG) to represent dependences between statements in a loop, constructing a basic equations group from the DDG, constructing a dependence equations group derived in part from the basic equations group, and determining a shifting vector for the loop from the dependence equations group, wherein the shifting vector to represent an offset to apply to each statement in the loop for statement shifting. Other embodiments are also disclosed.

Подробнее
27-05-1997 дата публикации

Device and method for parallelizing compilation optimizing data transmission

Номер: US5634059A
Автор: Koji Zaiki
Принадлежит: Matsushita Electric Industrial Co Ltd

The present invention relates to an optimizing compiler apparatus for converting a source program into an object program for use by a parallel computer, which optimizes the number of data transmissions between processing elements for a parallel computer made up of a plurality of processing elements, composed of a loop retrieval unit for retrieving the loop processes from a program, a data transmission calculation unit for calculating the data transmission count generated when each of the loop processes is parallelized, a parallelization determination unit for determining the loop to be parallelized as the loop, out of all the loops in a multiple loop, with the lowest data transmission count and a code generation unit for generating parallelized object code for the determined loop. The data transmission calculation unit is made up of a right side variable retrieval unit for retrieving the variables on the right side of an equation in the loop retrieved by the loop retrieval unit, a variable information storage unit for storing information relating to array variables which should be distributed among every processing element for the part of the program which comes before the loop retrieved by the loop retrieval unit and a calculation unit for calculating the data transmission count based on the variable information for the retrieved right side variable.

Подробнее
17-01-2023 дата публикации

Systems, media, and methods for identifying loops of or implementing loops for a unit of computation

Номер: US11556357B1
Принадлежит: MathWorks Inc

Systems, media, and methods may identify loops of a unit of computation for performing operations associated with the loops. The system, media, and methods may receive textual program code that includes a unit of computation that comprises a loop (e.g., explicit/implicit loop). The unit of computation may be identified by an identifier (e.g., variable name within the textual program code, text string embedded in the unit of computation, and/or syntactical pattern that is unique within the unit of computation). A code portion and/or a section thereof may include an identifier referring to the unit of computation, where the code portion and the unit of computation may be at independent locations of each other. The systems, media, and methods may semantically identify a loop that corresponds to the identifier and perform operations on the textual program code using the code portion and/or section.

Подробнее
12-08-2021 дата публикации

Patent JPWO2021156955A1

Номер: JPWO2021156955A1
Автор: [UNK]
Принадлежит: [UNK]

Подробнее
08-05-2002 дата публикации

Predicated execution of instructions in processors

Номер: GB2363480B
Автор: Nigel Peter Topham
Принадлежит: Siroyan Ltd

Подробнее
16-02-2016 дата публикации

Efficient implementation of RSA using GPU/CPU architecture

Номер: US9262166B2
Принадлежит: Intel Corp

Various embodiments are directed to a heterogeneous processor architecture comprised of a CPU and a GPU on the same processor die. The heterogeneous processor architecture may optimize source code in a GPU compiler using vector strip mining to reduce instructions of arbitrary vector lengths into GPU supported vector lengths and loop peeling. It may be first determined that the source code is eligible for optimization if more than one machine code instruction of compiled source code under-utilizes GPU instruction bandwidth limitations. The initial vector strip mining results may be discarded and the first iteration of the inner loop body may be peeled out of the loop. The type of operands in the source code may be lowered and the peeled out inner loop body of source code may be vector strip mined again to obtain optimized source code.

Подробнее