2008-08-24

Object **arr;

What? What do these asterisks do? We're having Ruby here, and the whole old pointers things from C++ are gone, aren't they?

Well, they are not, sorry. Remember one thing, please. This pointers thing is not a part of C++ or any other particular language. Pointers are how computers work. Each (reasonable) programming language either has pointers, or is inefficient. Ruby has them too. The difference is that in Ruby we don't use asterisks to denote we're using them.

What's "wrong"

Have a look at the example that proves we have pointers in Ruby:
a=[2,3,5,7]   #=> [2, 3, 5, 7]
b=a #=> [2, 3, 5, 7]
a<<11 #=> [2, 3, 5, 7, 11]
a #=> [2, 3, 5, 7, 11]
b #=> [2, 3, 5, 7, 11]
So, as you see,
a
and
b
are just pointers. And when we make the substitution in the second line, we just make them point to the same object in memory (array, in our example), so when we modify the object using an in place changing method (like
reverse!
,
clear
and others), the change will be also visible through the other variable.

We observe exactly the same behaviour when we use a string instead of an array:
a="abc"       #=> "abc"
b=a #=> "abc"
a<<"x" #=> "abcx"
a #=> "abcx"
b #=> "abcx"
Note that if you use
+=
instead of
<<
, a new instance of the string with the
"x"
appended is created and assigned to
a
, so if you want to make a string buffer and append to it some lines, it's probably better to use
<<
, because it does not create another object.

So, what if we would like to have an independent copy of a given array or a string? Ruby comes with a function
dup
that makes a copy of an object. (In fact there's also a function
clone
that behaves similarily, for today let's assume the functions do exactly the same (they don't), and let's use
dup
.) Change the line
b=a
to
b=a.dup
in both examples above and you will see it works like expected - modifying
a
does not modify
b
and vice versa.

Happy? So, is that all for today? Not quite. Have a look:
a=["a","b"]
b=a.dup
a<<"c"
b #=> ["a", "b"] # as expected
a[0]<<"x"
a             #=> ["ax", "b", "c"]
b #=> ["ax", "b"] # oops!
What happened? Well, now
b
is a copy of
a
, which means they are two separate arrays, so when we add
"c"
to one of them, the other does not get modified, we already know that. But in Ruby everything is an
Object
, so the elements of the arrays are objects too, and, what worse, they are the same objects. When we did
b=a.dup
, we created a separate array, but the elements of the array are pointers to the same strings as the elements of the original array. Our copy is not deep, we separated the top-level objects but not the elements of the array. So when we modified in place one of the elements, it got modified also in the other array, even though the arrays are separate objects.

Another way to check what happens:
x=Object::new       #=> #<Object:0x2d8ce10>
x.dup #=> #<Object:0x2d91dfc>
[x] #=> [#<Object:0x2d8ce10>]
[x].dup #=> [#<Object:0x2d8ce10>]
As you see,
dup
on an object makes a new object (the addresses differ), but
dup
on an array makes a new array but does not make a copy of the elements - it just makes a new array with its elements pointing to the original objects.

How to "fix" it
OK, so let's change the behaviour of
dup
for
Array
so that it makes a deep copy calling
dup
recursively on all its elements. Good idea? Yes, but NO NO NO!

Never do such a thing as changing the behaviour of a standard function! Someone can depend on how it works now, so you cannot change it! Remember well this lesson! Even if you think it's broken, don't fix it in this way!

Sorry for shouting, but it is important. Let's define a new function with the functionality we've just described. Here's how we do it:
class Object
def deep_dup
dup
end
end

class Array
def deep_dup
map{|e| e.deep_dup}
end
end

class Hash
def deep_dup
h={}
each{|k,v| h[k.deep_dup]=v.deep_dup}
h
end
end
First we define
deep_dup
for "normal" object as a simple copy, as they do not need any special treatment. Then we redefine (override) the function for
Array
and also for
Hash
, as there's exactly the same case with hashes as with arrays. You can check that after changing
dup
to
deep_dup
in the examples above, everything will work as expected.

Of course there are also other classes like
Set
(available after
require 'set'
) that might need overriding
deep_dup
for them to work.

So why use dup?
Now another lesson: why at all use this strange-working
dup
if we have such a nice
deep_dup
? Well, the answer is simple: as
deep_dup
copies everything, it might use much more memory (and time) than
dup
. That's why I said that if a language doesn't have pointers (that is, if a normal substitution works just like our
deep_dup
), it is inefficient. Because the solution is not to stop using
dup
. The solution is to use it carefully and wisely. And to remember, that there are pointers under the nice skin of Ruby.

Other methods
If we need a deep copy, the approach described above is probably the best one, but not the only one. One of the most secure methods to create a complete deep copy of an object is to serialise it to a "soul-less" string and deserialise it back. Then we can be absolutely sure that no part of the new object will be a part of the original one, as deserialisation for sure creates the object from scratch.

Ruby provides two easy serialisation methods:
YAML
and
Marshal
.
require 'yaml'

class Object

def m_dup
Marshal.load(Marshal.dump(self))
end

def y_dup
YAML.load(YAML.dump(self))
end

end
Both
YAML
and
Marshal
have a method
dump
that returns a string representation of the object passed (you can see how the strings look like by calling
dump
on various objects in irb), and the method
load
that does the reverse.

Differences:
- as you can see,
Marshal
's string is shorter so probably it's better and more efficient.
-
YAML
does not always work. I don't know why this is happening, but some complicated structures with sets, arrays, strings and hashes fail to load from the dumped string.

Both methods are most probably worse (slower) than our
deep_dup
, because they need to parse the string.

Last problem - self-references
There's one point, however, where the serialisation method works, and our
deep_dup
fails. It is when an array is an element of itself:
a=[]
a<<a
a #=> [[...]] # the three dots denote a self-reference
a.deep_dup
SystemStackError: stack level too deep
from (irb):87:in `deep_dup'
from (irb):87:in `deep_dup'
from (irb):87:in `map'
from (irb):87:in `deep_dup'
from (irb):87:in `deep_dup'
from (irb):87:in `map'
from (irb):87:in `deep_dup'
from (irb):87:in `deep_dup'
from (irb):87:in `map'
from (irb):87:in `deep_dup'
from (irb):87:in `deep_dup'
from (irb):87:in `map'
from (irb):87:in `deep_dup'
from (irb):87:in `deep_dup'
from (irb):87:in `map'
from (irb):87:in `deep_dup'
... 15577 levels...
from (irb):87:in `map'
from (irb):87:in `deep_dup'
from (irb):87:in `deep_dup'
from (irb):87:in `map'
from (irb):87:in `deep_dup'
from (irb):87:in `deep_dup'
from (irb):87:in `map'
from (irb):87:in `deep_dup'
from (irb):87:in `deep_dup'
from (irb):87:in `map'
from (irb):87:in `deep_dup'
from (irb):87:in `deep_dup'
from (irb):87:in `map'
from (irb):87:in `deep_dup'
from (irb):115
from (null):0

Aww, a failure, because duplicating
a
needs duplicating
a
first, and so on. That's why our method is not so perfect, and will fail also on examples like this:
a=[1,2,{:a=>4}]
a[2][:b]=a
a #=> [1, 2, {:a=>4, :b=>[...]}]
It can be fixed and handled just as it is handled in serialisation (it works without problems with such objects), and also in
inspect
(if it wasn't, you'd have an infinite output after creating such an object in irb), but I'll leave this as an exercise for the reader.

One more method
This last method is so bad that you should never use it, I just mention it to give you another knol to think about.
class Object
def i_dup
eval(inspect)
end
end
It calls the method
inspect
to create a human-readable representation of the object, just like irb does after each command, and then passes the string to
eval
that simply executes the string.

You should know by yourself why this method is bad, but just to make it clear:
- it only works for objects "made of"
Array
,
Hash
,
Numeric
,
String
,
Range
and
Symbol
instances (maybe some more I forgot about now), it won't work for
Object::new
,
- it doesn't handle self-reference (it fails when it sees the three dots).

That's all for today, I hope you learnt something new.

No comments: