At each
time step, the network received a reward r(t) = 1 if the
mouth moved closer to the food source, or r(t) = .1 if the
mouth moved farther.