Collaborative DDPG/Actor-Critic Example

Question

0 Stimmen

I have currently developed a DDPG model which optimizes traffic in intersections along one direction. I am looking towards implementing four of the same model on each direction, ie North-South, South-North, East-West, and West-East, ie I would like to run 4 DDPG models simultaneously each with its own local reward function. I have attempted to combine all 4 approaches but unfortunately the model appears to confuse actions in one direction with observations in another.

For example, if the agent sends a signal to a certain vehicle in the east-west lane to change its speed while simultaneously doing the same in the north-west direction for another vehicle, the system would consider the sum of all rewards for all actions performed, resulting in optimum actions performed on one approach being overshadowed by subpar actions on another.

It is for this reason that I believe that a collaborative multiagent approach may be ideal but I cannot seem to find anything in the Matlab supporting documents to indicate how this may be done beyond very simple simulink examples. I have noted the following which still leave significant gaps:

Train Multiple Agents to Perform Collaborative Task - MATLAB & Simulink (mathworks.com)

Train Multiple Agents for Area Coverage - MATLAB & Simulink (mathworks.com)

Train Multiple Agents for Path Following Control - MATLAB & Simulink (mathworks.com)

My current model utilizes a custom environment which interfaces with another software's COM in order to generate a sample environment from which observations are taken and actions are applied. I am not currently using Simulink as a result of the need for the external traffic simulation software being used. My current system involves a rlNumericspec observation space which uses 10 variables and a continuous action space which performs 2 actions.

I would like to simultaneously run the 4 of the same DDPG agents (or other actor-critic models if necessary) which would each have their own independent reward and action space. Is this possible with the Reinforcement Learning Toolbox as of 2020 and if so how may one approach it? More specifically:

How would one specify the 4 different sets of observations/actions and how would this be done in the same custom Constructor Function? Each one is of the form rlNumericSpec([10 1]) for a total of 40 observations and an observation space of the form rlNumericSpec([8 1],'LowerLimit',[20;20],'UpperLimit',[40;40]). I have tried following this example (Train Multiple Agents for Path Following Control - MATLAB & Simulink (mathworks.com)) for the actioninfo and obsinfo syntax, ie obsinfo = {obsinfo1, obsinfo2...) whuch thus far has returned an error.

For applying said actions to the custom environment, how would said actions appear once the model is running? Would it simply be of the form Action1() Action2(), etc?

How would the individual localized reward function be set within the step function. By default for a single agent the reward is simply stored as "Reward", would there be a form such that the rewards would be discretized into Reward_agent1, Reward_agent2, etc?

Is it an absolute must to use simulink or can this be done with my existing custom environment setup?

Are there any additional resources which may help me achieve this that I may have missed?

I understand that this is quite a large question, but I hope that this would also help others looking to use this software for more complex multi-agent applications without simulink. Thank you in advance for your assistance.

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Answer 1

Emmanouil Tzorakoleftherakis am 11 Dez. 2020

Bearbeitet: Emmanouil Tzorakoleftherakis am 11 Dez. 2020

1 Stimme

Hello,

As you noticed, as of R2020b we support (decentralized) multi-agent RL but only in Simulink. We are looking to expand this to more centralized multi-agent approaches in future releases, potentially outside of Simulink (i.e. in MATLAB) as well.

One workaround would be to convert your MATLAB-based environment into a Simulink one using the MATLAB function block. That would allow you to use multi-agent training in 20b and refer to the example links you posted.

Another workaround is to combine all the observations and actions into a single DDPG agent. That way you would be able to use a MATLAB environment (is this what you meant when you said that you combined the 4 approaches?). As you found out though, decentralized multi-agent training comes with challenges, particularly because it leads to non-stationary environments. I don't know how you have set up your problem, but each agent will need to be aware of what every other agent is doing and vice versa. So all previous actions will need to show up as observations for example. That may resolve the situation you described where optimum actions are overshadowed by subpar ones (although the individual subrewards will need to also be properly scaled).

You may also want to look at this example which sounds similar to what you are trying to do.

Hope that helps

5 Kommentare
3 ältere Kommentare anzeigen 3 ältere Kommentare ausblenden

Tarek Ghoul am 12 Dez. 2020

Hello Emmanouil, thank you for the quick response.

>Another workaround is to combine all the observations and actions into a single DDPG agent. That way you would be able to use a MATLAB environment (is this what you meant when you said that you combined the 4 approaches?). As you found out though, decentralized multi-agent training comes with challenges, particularly because it leads to non-stationary environments.

This is exactly what I was refering to as I was unable to "localize" the 4 action-reward pairs through a single agent. I can try inputting the previous actions as part of the observation pair to increase the total number of observatins from 4x [10 1] to 4x[12 1] (accounting for both previous actions), although I am not quite sure what you mean by "properly scalling individual subrewards".

The reward is calculated by looking at a performance metric found in the literature relating to delay which is calculated with timesteps of dt=10s (in the external simulation space). Does "properly scaling individual subrewards" refer to the total reward calculated within the body of the step function, or is there an alternative way to partition the reward between the step function and the reset fuction that I am missing?

>One workaround would be to convert your MATLAB-based environment into a Simulink one using the MATLAB function block. That would allow you to use multi-agent training in 20b and refer to the example links you posted.

I have attempted to do this but was unsure of how to represent the environment. The current MATLAB setup that I am trying to convert involves creating custom function handles in order to pass a variable called "Vissim" which is an actxserver() COM interface that connects to an external simulation software. The following is my constructor function with the current 1 agent setup:

ResetHandle = @()MyResetFunction(Vissim);

StepHandle = @(Action,LoggedSignals) MyStepFunction(Action,LoggedSignals,Vissim);

env = rlFunctionEnv(ObservationInfo,ActionInfo,StepHandle,ResetHandle);

The constructor sets up the external simulation COM interface, the reset function loads the specific intersection/roadway, and the step function is evaluated in matlab increments of 1, with the external software (vissim) running for 10 external (non-matlab) simulation seconds before outputting observations that are used to generate a reward. Once a certain number of steps have elapsed, the reset function reloads the intersection to its initial state. The Vissim object, ie the COM interface is passed via the custom stephandle from the step to step to avoid the simulation from resetting each time a step is performed. Hopefully this would help explain the problem a little bit better.

Would Simulink would be able to do something like this with regards to inputs and outputs that are COM objects rather than simple scalar values? Furhtermore, is there anything in the MATLAB supporting documents that could be helpful in the switch between MATLAB and Simulink?

I am mainly wondering how the replaced environment block could be represented by a single function. Would it entail having subfunctions labelled constructor, step, and reset? I can see from the examples that the output to the typical environment involves "reward, isdone, and obsinfo" as part of block output signals, but no indication within the documents as to how to set up entire environments as unified MATLAB functions.

Thank you again for your help, this has been extremely useful

Emmanouil Tzorakoleftherakis am 12 Dez. 2020

CreateSimulinkEnvironmentAndTrainAgentExample_02.png

By scaling subrewards I just meant that often times the reward signal consists of different components added together (presumably from the different "agents") so I was literally referring to properly scaling the different terms added together.

There is no example that shows how to go from a MATLAB environment to a Simulink one, but going through a simple Simulink example like this one should be helpful. In the attached image from this example you see you pretty much need 1) the RL agent block, 2) the dynamics/step function which can be copied and pasted from what you have now in a separate MATLAB Function block and 3) the reward which can be its own MATLAB Fcn block or can be included in the same one as "step".

The one thing that may be a little tricky is how to pass that Vissim object you mentioned in the MATLAB Fcn block. The difficulties arise because the MATLAB Fcn block generates C code under the hood to run so anything that's included should support code generation. The workaround would be to use 'coder.extrinsic' to call the interpreter instead.

So bottom line what I would suggest without knowing all the details:

Create a MATLAB function 'myfun' that either creates or loads the vissim object from workspace. You can make that a persistent variable to avoid doing that all the time. After loading this function, copy and paste over your dynamics from the step function. Now, inside the MATLAB Function block, start with 'coder.extrinsic('myfun')' to avoid and codegen issues and after that call 'myfun'.

That should give you the general idea.

Tarek Ghoul am 13 Dez. 2020

In MATLAB Online öffnen

Hello again Emmanouil,

After taking your advice I have attempted to transfer the code from standard MATLAB to Simulink. I have spent several hours reading the supporting documentary and taking the onramp course available online. I am still having difficulties loading in vissim into simulink as I keep obtaining errors.

In the original model, the Vissim object was loaded in a one time use constructor using Vissim =actxserver('Vissim.Vissim.700') which also passed Vissim onto the reset and step handles as I mentioned earlier. In this model, I cannot seem to get Vissim to act as an input to the Matlab custom step/environment Fcn Block.

As you recommended:

>Create a MATLAB function 'myfun' that either creates or loads the vissim object from workspace. You can make that a persistent variable to avoid doing that all the time.

I have created the following code at the workspace level to import the vissim object as a persistant variable. While the object Vissim persists in other files if accessed, it does not seem to be the case for the simulated simulink environment.

coder.extrinsic('StartVissim2')
makepersist()
[Vissim] = StartVissim2()
function [test] = makepersist()
    persistent Vissim
    persistent vnet
    persistent sim
end
function [Vissim] = StartVissim2()
    Vissim = actxserver('Vissim.Vissim.700');
    Vissim.LoadNet('D:\User\Vissim\testnet\testnetdiscrete.inpx');
    vnet=Vissim.net 
    sim = Vissim.sim
    mlock
end

After running the above, any reference to Vissim via dot indexing is undefined returning "Attempt to extract field 'Simulation/Net/etc.' is undefined". It appears as though Vissim itself is stored as a "mxArray", but any attempts to call it yields an error.

For reference, my environment/step function is (with irrelevant/trivial portions removed for readability):

function [rewardA, rewardB,isDone,LoggedSignals,observation] = stepfunction(LoggedSignals,actionA,actionB,observation)
coder.extrinsic('StartVissim2')
TR=0
%test variable to get it to run a function just once, will be modified once working
if TR <1
    [Vissim] = StartVissim2()
end
 %sets up dot indexed variables to be called
            sim = Vissim.Simulation;
            vnet = Vissim.Net;
            
%gets logged signals
State = LoggedSignals.State
% Unpack state vector from previous step
n1 = LoggedSignals.State(1);
n2 = LoggedSignals.State(2);
n3 = LoggedSignals.State(3);
n4= LoggedSignals.State(4);
asp1= LoggedSignals.State(5);
asp2= LoggedSignals.State(6);
asp3= LoggedSignals.State(7);
asp4= LoggedSignals.State(8);
PGap= LoggedSignals.State(9);
TSG= LoggedSignals.State(10);
TSR= LoggedSignals.State(11);
%Stores and executes actions using function
 VP1 = Action(1);
 VP2 = Action(2);
ApplyAction(Vissim,vnet,LoggedSignals,VP1,VP2,PlatoonIDmat1,PlatoonIDmat2,TSG,TSR,TSGmax,TSRmax);
%Function which runs 10 vissim timesteps using the 'Vissim.simulation.RunSingleStep' 
%command while obtaining vehicle data from Vissim and interpreting it 
[Outputs_cycle_parameters] = Generate_CV_and_NCV_Matrices(Vissim,sim,vnet,LoggedSignals);
%extracts relevant reward and observation parameters from Outputs_cycle_parameters
                ConfM2= Outputs_cycle_parameters(8);
                TSGmax = Outputs_cycle_parameters(13);
                TSRmax = Outputs_cycle_parameters(14);
                TSG = Outputs_cycle_parameters(15);
                TSR = Outputs_cycle_parameters(16);
%obtains observation data now that actions have been performed and the reward is obtained
                [n1, n2, n3,n4,asp1,asp2,asp3,asp4] = getns(Vissim,sim,vnet);
                [PlatoonIDmat1 PlatoonIDmat2 PGap] = GetPlatoons(Vissim, vnet,LoggedSignals);
%sets LoggedSignal State as well as a .Pass that allows for the maximum Green and Red times
%to be recorded and saved for the next timestep without being considered observations
                LoggedSignals.State(1)=n1;
                LoggedSignals.State(2)=n2;
                LoggedSignals.State(3)=n3;
                LoggedSignals.State(4)=n4;
                LoggedSignals.State(5)=asp1;
                LoggedSignals.State(6)=asp2;
                LoggedSignals.State(7)=asp3;
                LoggedSignals.State(8)=asp4;
                LoggedSignals.State(9)= PGap
                LoggedSignals.State(10)=TSG;
                LoggedSignals.State(11)=TSR;
                LoggedSignals.Pass(1) = TSGmax
                LoggedSignals.Pass(2) = TSRmax
                
                
%Defining the Observation
Observation = LoggedSignals.State;
% Update system states
NextObs = LoggedSignals.State;
            
% Check terminal condition (No reasonable "done" condition due to Q0), this is governed by steps
IsDone = 0;
%Defines reward based on previous value                   
Reward = -ConfM2   
end
function [Outputs_cycle_parameters]= Generate_CV_and_NCV_Matrices(Vissim,sim,vnet,LoggedSignals)
    %Function Which runs the simulation using a for loop with sim.RunSingleStep and several
    %other simple matlab computations. To avoid innundating this with code, the most relevant
    %portions with regards to the simulink problem are:
    
    get(Vissim.Net.SignalHeads.ItemByKey(15), 'AttValue', 'State') %obtains signal states from Vissim COM
    speedmat = Vissim.Net.Vehicles.GetMultiAttValues('Speed') %obtains speed using the GetMultiAttValues method
    posmat = Vissim.Net.Vehicles.GetMultiAttValues('Pos')%obtains position using the GetMultiAttValues method
    
    %the remaining code simply manipulates this to obtain traffic parameters for the observation space and collects
    %  all speeds/positions/types in one big matrix
end
function [n1, n2, n3,n4,asp1,asp2,asp3,asp4] = getns(Vissim,sim,vnet)
%simple funtion to obtain the vehicle and average speeds of vehicles in 4 segments of an approach
%for the observation
end
function [PlatoonIDmat1 PlatoonIDmat2 PGap] = GetPlatoons(Vissim, vnet,LoggedSignals);
%Simple matrix manipulation to identify clusters of vehicles using the similar getmultiattvalues method
end
function ApplyAction(Vissim,vnet,LoggedSignals,VP1,VP2,PlatoonIDmat1,PlatoonIDmat2,TSGm,TSRm,TSGmax,TSRmax,App_Number)
%applies speeds to vehicles and to objects in the Vissim Simulation using both conditional
% statements and the speeds obtained by the action with the following COM related syntax:
vnet.DesSpeedDecision.ItemByKey(2+5*(App_Number-1)).set('AttValue','DesSpeedDistr(70)',VP2);
vnet.Vehicles.ItemByKey(PlatoonIDmat2(n,1)).set('AttValue','DesSpeed',VP1)
end
%attempt to call function in order for it to load Vissim within the environmennt. UNfortunately
%this does not work 
function [Vissim] = StartVissim2()
        Vissim = actxserver('Vissim.Vissim.700');
        Vissim.LoadNet('D:\User\Vissim\testnet\testnetdiscrete.inpx');
        mlock
end

If I understood correctly, the above code is what is necessary (beyond the COM interface) for the following structure to be used (assuming 2 agents). Effectively the same as the other examples with the input being the actions and the previous state via logged signals and observations, and the output being the rewards, observation, isdone, and logged signals. At the time being I am testing the stepfunction independently of the other systems to ensure that it works beforehand

I hope that this explains the situation. With this in mind, how might one go about modifying the code with regards to the extrinsic functions and persistant variables to get the COM interface to stick and be able to call variables within the simulink custom function?

Thank you again for all of your help and your quick responses. I appreciate the time that you have spent helping me and others like me with learning how to use this powerful software.

Emmanouil Tzorakoleftherakis am 13 Dez. 2020

Bearbeitet: Emmanouil Tzorakoleftherakis am 16 Dez. 2020

In MATLAB Online öffnen

Hi again Tarek,

No problem, I try to make time to help out every now and then given that this is not my dayjob.

I think your setup is in the right direction. Here is where I believe the problem is: Vssim is an object that MATLAB cannot directly recognize/generate code from. Converting objects to their C Code equivalent is necessary, particularly when you are using these objects an inputs/outputs in function as you are doing. My recommendation is to encapsulate every single function where vssim is needed as input/output in a single function that only inputs/outputs variables that can be directly read by MATLAB. So something like:

function [rewardA, rewardB,isDone,LoggedSignals,observation] = stepfunction(LoggedSignals,actionA,actionB,observation)
coder.extrinsic('myfun')
[rewardA, rewardB,isDone,LoggedSignals,observation] = myfun(LoggedSignals,actionA,actionB,observation)
end

Then anything that handles Vssim put it in myfun and MATLAB will not bother try to convert it to C code which may eliminate the errors you are seeing.

Hopefully that works

Melden Sie sich an, um zu kommentieren.

Collaborative DDPG/Actor-Critic Example

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

5 Kommentare
3 ältere Kommentare anzeigen 3 ältere Kommentare ausblenden

Weitere Antworten (0)

Kategorien

Produkte

Version

Tags

Community Treasure Hunt

Collaborative DDPG/Actor-Critic Example

0 Kommentare -2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

5 Kommentare 3 ältere Kommentare anzeigen 3 ältere Kommentare ausblenden

Weitere Antworten (0)

Kategorien

Produkte

Version

Tags

Siehe auch

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

5 Kommentare
3 ältere Kommentare anzeigen 3 ältere Kommentare ausblenden