Is it possible to avoid copy-on-write behavior in functions yet?

12 Ansichten (letzte 30 Tage)
Christopher
Christopher am 3 Okt. 2017
Kommentiert: Tyler Warner am 24 Mai 2018
As I understand, MATLAB has used a system called 'copy-on-write' for function calls. So if you have a function of the form
function [out] = myfunction(out,in1,in2)
in1 = rand(1);
in2 = rand(1);
out = in1+in2;
MATLAB will create a new space in memory for a new copy of variables out, in1, and in2, perform the given operations on these arrays, and then copy modified arrays onto the old variable memory space if it is an output variable. This will also occur for the variable 'out', and will even occur for in1 and in2 if written as
function [out,in1,in2] = myfunction(out,in1,in2)
in1 = rand(1);
in2 = rand(1);
out = in1+in2;
Obviously, this behavior wastes time if you know that the old variable should be replaced by the new variable. I have long avoided using functions for this reason, resulting in messy code.
Is it possible to pass variables to functions by reference? If no, will this be possible in a future MATLAB?
EDIT:
A commenter noted that since the inputs in1 and in2 are defined in the function they do not need to be passed through the function. Perhaps the following better describes the problem:
function [out,ind1,ind2] = myfunction(out,in1,in2)
ind = 5;
in1(ind) = rand(1);
in2(ind) = rand(1);
out = in1+in2;
so the function modifies one element of each of these arrays, although the entire variables are copied before being modified.
  10 Kommentare
Guillaume
Guillaume am 4 Okt. 2017
With regards to the latest edit, and assuming that ind1 and ind2 are meant to be in1 and in2 then, according to Loren's blog linked in Jan's answer, no copy is made.
But as I said in my answer, whether or not it does should way down the list of priorities until it's been proven to be a bottleneck.
James Tursa
James Tursa am 4 Okt. 2017
"... no copy is made ..."
Sort of. No copy is made IF this function is called from within another function, and IF the calling routine uses syntax where the input and output variables match, and IF the original variables are not shared data copies of something else to begin with. If any of those conditions is not met, then a data copy will be made.

Melden Sie sich an, um zu kommentieren.

Antworten (3)

Guillaume
Guillaume am 3 Okt. 2017
First, you've fallen into the trap of premature optimisation. You've decided not to use functions because they may slow your code but you don't know for sure (In all likelyhood, it's the opposite, it's easier for matlab jit compiler to optimise functions) and instead ended with messy code.
So really, the answer to your question is: stop worrying about the internal implementation of matlab until you've proven it is an issue by profiling your code. Bear in mind that the internal implementation is not fully documented and subject to change from versions to versions.
Secondly, you've misunderstood copy-on-write. In your example, copy-on-write is never triggered for any of the variables. Brand new variables are created, no copying occurs. Copy-on-write is triggered when you're modifying part of a variable but still have the original in another variable:
a = [1 2 3];
a(2) = 4; %no copy-on-write
a = [1 2 3];
b = a;
b(2) = 4; %copy-on-write triggered since original still in a
a = [1 2 3];
b = a;
b = [1 4 3]; %no copy-on-write since b is a different variable altogether (your example)
As for reusing the same memory when input and output are the same variable, I believe matlab jit compiler does that, but again, we're talking about implementation details that should not matter much and are subject to change.
  1 Kommentar
Jan
Jan am 3 Okt. 2017
+1: "trap of premature optimisation". Christopher, read this carefully. You got a lot of very valuable suggestions in this thread.

Melden Sie sich an, um zu kommentieren.


Cedric
Cedric am 3 Okt. 2017
Bearbeitet: Cedric am 5 Okt. 2017
I agree with most of what is said in the comments/answers. Yet, if you really needed to avoid copies for good reasons in a context far more complex and/or specific than the example that you give, you could create a handle class and always work on a single copy of whatever you pass to functions/methods.
Again, there is no point in doing this for simple data structures unless you have proven that you cannot afford the copy-on-write, so don't jump on this solution if you don't fully understand what you are doing.
Yet, skimming the history of your questions, I think that you know what you are doing and that people reacted to your comment about "not using functions and getting messy code for avoiding copy-on-write" a bit too quickly .. but you have to admit that in most cases this is almost a heretical statement/approach ;-)
Anyhow, assuming that you need this for valid reasons, here is an example:
classdef VeryVeryLargeArray < handle
properties
array
end
methods
function obj = VeryVeryLargeArray( builder, varargin )
obj.array = builder( varargin{:} ) ;
end
% Possibly some overload of e.g. SUBSREF/SUBSASGN/SIZE and operators.
end
end
Using it for building e.g. a 5GB random array (so you can see something in the task manager):
>> n = floor( sqrt( 5e9/8 )) ;
>> vvla = VeryVeryLargeArray( @rand, n ) ;
you see a 5GB jump in the memory usage. Now if you call a function e.g. setRow :
function setRow( vvla, rowId, value )
vvla.array(rowId,:) = value ;
end
after having set a break point on the 3rd line with end:
setRow( vvla, 1, 0 ) ;
you won't see a second jump due to a copy-on-write and your array will have been updated (even in the base workspace, because handles work "a bit like pointers").
EDIT 10/4 @ 12:41UTC: I am just giving you a quick example of overload of SUBSREF in case you wanted to transfer block indexing of the object(s) to the internal array(s):
function out = subsref( obj, S )
if S(1).type(1) ~= '.'
out = subsref( obj.array, S ) ;
else
out = builtin( 'subsref', obj, S ) ;
end
end
This method could be added after my comment in the methods block of the class definition. The same would have to be done for SUBSASGN and possibly SIZE. The advantage is that most functions could operate on the object the way they operate on any numeric array:
>> vvla(2:4, 10:13)
ans =
0.5108 0.1707 0.3188 0.3955
0.8176 0.2277 0.4242 0.3674
0.7948 0.4357 0.5079 0.9880
This accesses vvla.array(2:4,10:13) and has the advantage to make the internal structure transparent to the user (at least for what is managed by SUBSREF).
Note that testing S(1).type(1)~='.' (and not just S(1).type(1)=='(') allows to transfer any () or {} indexing to the array property, so you can use builders of cell arrays:
vvlca = VeryVeryLargeArray( @cell, 4, 5 ) ;
BUT you cannot easily (or at all) manage properly CSL outputs (especially when you want to nest these objects), so there is a limit to what you can achieve with overloading indexing methods. [If you try, you will likely spend hours wondering why nargout is defined through a call to your overloaded NUMEL and not to the builtin, and trying to find workarounds.]
EDIT 10/5 @ 12:32UTC: As mentioned, you can overload specific operations or functions that are relevant to the use that you make of these arrays. If you want to be able to use DIFF transparently for example:
function df = diff( obj, varargin )
df = diff( obj.array, varargin{:} ) ;
end

Jan
Jan am 3 Okt. 2017
Bearbeitet: Jan am 3 Okt. 2017
You are right: When the algorithm is very efficient and processed on a multi-core machine, the memory copies can become the bottleneck. I had the same problem in an optimization tool written in C, which called a FORTRAN library for solving a huge matrix equation with a known pattern. The two deep data copies when entering and leaving the library took 40% of the total run time. Fortunately we had the FORTRAN source code and modify it to process the matrices in-place.
But now imagine we had avoided to use functions at first. As you wrote, the code would have been too messy to optimize it.
You can avoid deep data copies sometimes:
x = zeros(10000, 10000);
n = 1e6;
tic;
for k = 1:n
x = addInSubFcn(x);
end
toc
tic;
for k = 1:n
[xx, index] = addInCaller(x);
x(index) = xx;
end
toc
function x = addInSubFcn(x)
index = randi(numel(x));
x(index) = x(index) + rand;
function [xx, index] = addInCaller(x)
index = randi(numel(x));
xx = x(index) + rand;
R2016b/64, Win7:
Elapsed time is 2.583763 seconds. % In subfunction
Elapsed time is 1.884192 seconds. % In caller
Keep this in mind, when you create functions to modify arrays.

Kategorien

Mehr zu Debugging and Analysis finden Sie in Help Center und File Exchange

Produkte

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by