OptimalNeuralNetworkInputLayer/SpatialTemporalFeatureSelectionOptimalApproach02.tex at master · emmanouilb/OptimalNeuralNetworkInputLayer · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%% bare_jrnl.tex
%% V1.4b
%% 2015/08/26
%% by Michael Shell
%% see http://www.michaelshell.org/
%% for current contact information.
%%
%% This is a skeleton file demonstrating the use of IEEEtran.cls
%% (requires IEEEtran.cls version 1.8b or later) with an IEEE
%% journal paper.
%%
%% Support sites:
%% http://www.michaelshell.org/tex/ieeetran/
%% http://www.ctan.org/pkg/ieeetran
%% and
%% http://www.ieee.org/

%%*************************************************************************
%% Legal Notice:
%% This code is offered as-is without any warranty either expressed or
%% implied; without even the implied warranty of MERCHANTABILITY or
%% FITNESS FOR A PARTICULAR PURPOSE!
%% User assumes all risk.
%% In no event shall the IEEE or any contributor to this code be liable for
%% any damages or losses, including, but not limited to, incidental,
%% consequential, or any other damages, resulting from the use or misuse
%% of any information contained here.
%%
%% All comments are the opinions of their respective authors and are not
%% necessarily endorsed by the IEEE.
%%
%% This work is distributed under the LaTeX Project Public License (LPPL)
%% ( http://www.latex-project.org/ ) version 1.3, and may be freely used,
%% distributed and modified. A copy of the LPPL, version 1.3, is included
%% in the base LaTeX documentation of all distributions of LaTeX released
%% 2003/12/01 or later.
%% Retain all contribution notices and credits.
%% ** Modified files should be clearly indicated as such, including  **
%% ** renaming them and changing author support contact information. **
%%*************************************************************************


% *** Authors should verify (and, if needed, correct) their LaTeX system  ***
% *** with the testflow diagnostic prior to trusting their LaTeX platform ***
% *** with production work. The IEEE's font choices and paper sizes can   ***
% *** trigger bugs that do not appear when using other class files.       ***                          ***
% The testflow support page is at:
% http://www.michaelshell.org/tex/testflow/


\documentclass[journal]{IEEEtran}
%
% If IEEEtran.cls has not been installed into the LaTeX system files,
% manually specify the path to it like:
% \documentclass[journal]{../sty/IEEEtran}


% Some very useful LaTeX packages include:
% (uncomment the ones you want to load)


% *** MISC UTILITY PACKAGES ***
%
%\usepackage{ifpdf}
% Heiko Oberdiek's ifpdf.sty is very useful if you need conditional
% compilation based on whether the output is pdf or dvi.
% usage:
% \ifpdf
%   % pdf code
% \else
%   % dvi code
% \fi
% The latest version of ifpdf.sty can be obtained from:
% http://www.ctan.org/pkg/ifpdf
% Also, note that IEEEtran.cls V1.7 and later provides a builtin
% \ifCLASSINFOpdf conditional that works the same way.
% When switching from latex to pdflatex and vice-versa, the compiler may
% have to be run twice to clear warning/error messages.


% *** CITATION PACKAGES ***
%
\usepackage{cite}
% cite.sty was written by Donald Arseneau
% V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
% \cite{} output to follow that of the IEEE. Loading the cite package will
% result in citation numbers being automatically sorted and properly
% "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
% cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
% \cite will automatically add leading space, if needed. Use cite.sty's
% noadjust option (cite.sty V3.8 and later) if you want to turn this off
% such as if a citation ever needs to be enclosed in parenthesis.
% cite.sty is already installed on most LaTeX systems. Be sure and use
% version 5.0 (2009-03-20) and later if using hyperref.sty.
% The latest version can be obtained at:
% http://www.ctan.org/pkg/cite
% The documentation is contained in the cite.sty file itself.


% *** GRAPHICS RELATED PACKAGES ***
%
\ifCLASSINFOpdf
  \usepackage[pdftex]{graphicx}
  %declare the path(s) where your graphic files are
  \graphicspath{{../pdf/}{../jpeg/}}
  %and their extensions so you won't have to specify these with
  %every instance of \includegraphics
  \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
\else
  % or other class option (dvipsone, dvipdf, if not using dvips). graphicx
  % will default to the driver specified in the system graphics.cfg if no
  % driver is specified.
  % \usepackage[dvips]{graphicx}
  % declare the path(s) where your graphic files are
  % \graphicspath{{../eps/}}
  % and their extensions so you won't have to specify these with
  % every instance of \includegraphics
  % \DeclareGraphicsExtensions{.eps}
\fi
% graphicx was written by David Carlisle and Sebastian Rahtz. It is
% required if you want graphics, photos, etc. graphicx.sty is already
% installed on most LaTeX systems. The latest version and documentation
% can be obtained at:
% http://www.ctan.org/pkg/graphicx
% Another good source of documentation is "Using Imported Graphics in
% LaTeX2e" by Keith Reckdahl which can be found at:
% http://www.ctan.org/pkg/epslatex
%
% latex, and pdflatex in dvi mode, support graphics in encapsulated
% postscript (.eps) format. pdflatex in pdf mode supports graphics
% in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
% that all non-photo figures use a vector format (.eps, .pdf, .mps) and
% not a bitmapped formats (.jpeg, .png). The IEEE frowns on bitmapped formats
% which can result in "jaggedy"/blurry rendering of lines and letters as
% well as large increases in file sizes.
%
% You can find documentation about the pdfTeX application at:
% http://www.tug.org/applications/pdftex


% *** MATH PACKAGES ***
%
\usepackage{amsmath}
\usepackage{newtxtext,newtxmath}
% A popular package from the American Mathematical Society that provides
% many useful and powerful commands for dealing with mathematics.
%
% Note that the amsmath package sets \interdisplaylinepenalty to 10000
% thus preventing page breaks from occurring within multiline equations. Use:
%\interdisplaylinepenalty=2500
% after loading amsmath to restore such page breaks as IEEEtran.cls normally
% does. amsmath.sty is already installed on most LaTeX systems. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/pkg/amsmath


% *** SPECIALIZED LIST PACKAGES ***
%
%\usepackage{algorithmic}
% algorithmic.sty was written by Peter Williams and Rogerio Brito.
% This package provides an algorithmic environment fo describing algorithms.
% You can use the algorithmic environment in-text or within a figure
% environment to provide for a floating algorithm. Do NOT use the algorithm
% floating environment provided by algorithm.sty (by the same authors) or
% algorithm2e.sty (by Christophe Fiorio) as the IEEE does not use dedicated
% algorithm float types and packages that provide these will not provide
% correct IEEE style captions. The latest version and documentation of
% algorithmic.sty can be obtained at:
% http://www.ctan.org/pkg/algorithms
% Also of interest may be the (relatively newer and more customizable)
% algorithmicx.sty package by Szasz Janos:
% http://www.ctan.org/pkg/algorithmicx


% *** ALIGNMENT PACKAGES ***
%
%\usepackage{array}
% Frank Mittelbach's and David Carlisle's array.sty patches and improves
% the standard LaTeX2e array and tabular environments to provide better
% appearance and additional user controls. As the default LaTeX2e table
% generation code is lacking to the point of almost being broken with
% respect to the quality of the end results, all users are strongly
% advised to use an enhanced (at the very least that provided by array.sty)
% set of table tools. array.sty is already installed on most systems. The
% latest version and documentation can be obtained at:
% http://www.ctan.org/pkg/array


% IEEEtran contains the IEEEeqnarray family of commands that can be used to
% generate multiline equations as well as matrices, tables, etc., of high
% quality.


% *** SUBFIGURE PACKAGES ***
%\ifCLASSOPTIONcompsoc
%  \usepackage[caption=false,font=normalsize,labelfont=sf,textfont=sf]{subfig}
%\else
%  \usepackage[caption=false,font=footnotesize]{subfig}
%\fi
% subfig.sty, written by Steven Douglas Cochran, is the modern replacement
% for subfigure.sty, the latter of which is no longer maintained and is
% incompatible with some LaTeX packages including fixltx2e. However,
% subfig.sty requires and automatically loads Axel Sommerfeldt's caption.sty
% which will override IEEEtran.cls' handling of captions and this will result
% in non-IEEE style figure/table captions. To prevent this problem, be sure
% and invoke subfig.sty's "caption=false" package option (available since
% subfig.sty version 1.3, 2005/06/28) as this is will preserve IEEEtran.cls
% handling of captions.
% Note that the Computer Society format requires a larger sans serif font
% than the serif footnote size font used in traditional IEEE formatting
% and thus the need to invoke different subfig.sty package options depending
% on whether compsoc mode has been enabled.
%
% The latest version and documentation of subfig.sty can be obtained at:
% http://www.ctan.org/pkg/subfig


% *** FLOAT PACKAGES ***
%
%\usepackage{fixltx2e}
% fixltx2e, the successor to the earlier fix2col.sty, was written by
% Frank Mittelbach and David Carlisle. This package corrects a few problems
% in the LaTeX2e kernel, the most notable of which is that in current
% LaTeX2e releases, the ordering of single and double column floats is not
% guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
% single column figure to be placed prior to an earlier double column
% figure.
% Be aware that LaTeX2e kernels dated 2015 and later have fixltx2e.sty's
% corrections already built into the system in which case a warning will
% be issued if an attempt is made to load fixltx2e.sty as it is no longer
% needed.
% The latest version and documentation can be found at:
% http://www.ctan.org/pkg/fixltx2e


%\usepackage{stfloats}
% stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
% the ability to do double column floats at the bottom of the page as well
% as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
% LaTeX2e). It also provides a command:
%\fnbelowfloat
% to enable the placement of footnotes below bottom floats (the standard
% LaTeX2e kernel puts them above bottom floats). This is an invasive package
% which rewrites many portions of the LaTeX2e float routines. It may not work
% with other packages that modify the LaTeX2e float routines. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/pkg/stfloats
% Do not use the stfloats baselinefloat ability as the IEEE does not allow
% \baselineskip to stretch. Authors submitting work to the IEEE should note
% that the IEEE rarely uses double column equations and that authors should try
% to avoid such use. Do not be tempted to use the cuted.sty or midfloat.sty
% packages (also by Sigitas Tolusis) as the IEEE does not format its papers in
% such ways.
% Do not attempt to use stfloats with fixltx2e as they are incompatible.
% Instead, use Morten Hogholm'a dblfloatfix which combines the features
% of both fixltx2e and stfloats:
%
% \usepackage{dblfloatfix}
% The latest version can be found at:
% http://www.ctan.org/pkg/dblfloatfix


%\ifCLASSOPTIONcaptionsoff
%  \usepackage[nomarkers]{endfloat}
% \let\MYoriglatexcaption\caption
% \renewcommand{\caption}[2][\relax]{\MYoriglatexcaption[#2]{#2}}
%\fi
% endfloat.sty was written by James Darrell McCauley, Jeff Goldberg and
% Axel Sommerfeldt. This package may be useful when used in conjunction with
% IEEEtran.cls'  captionsoff option. Some IEEE journals/societies require that
% submissions have lists of figures/tables at the end of the paper and that
% figures/tables without any captions are placed on a page by themselves at
% the end of the document. If needed, the draftcls IEEEtran class option or
% \CLASSINPUTbaselinestretch interface can be used to increase the line
% spacing as well. Be sure and use the nomarkers option of endfloat to
% prevent endfloat from "marking" where the figures would have been placed
% in the text. The two hack lines of code above are a slight modification of
% that suggested by in the endfloat docs (section 8.4.1) to ensure that
% the full captions always appear in the list of figures/tables - even if
% the user used the short optional argument of \caption[]{}.
% IEEE papers do not typically make use of \caption[]'s optional argument,
% so this should not be an issue. A similar trick can be used to disable
% captions of packages such as subfig.sty that lack options to turn off
% the subcaptions:
% For subfig.sty:
% \let\MYorigsubfloat\subfloat
% \renewcommand{\subfloat}[2][\relax]{\MYorigsubfloat[]{#2}}
% However, the above trick will not work if both optional arguments of
% the \subfloat command are used. Furthermore, there needs to be a
% description of each subfigure *somewhere* and endfloat does not add
% subfigure captions to its list of figures. Thus, the best approach is to
% avoid the use of subfigure captions (many IEEE journals avoid them anyway)
% and instead reference/explain all the subfigures within the main caption.
% The latest version of endfloat.sty and its documentation can obtained at:
% http://www.ctan.org/pkg/endfloat
%
% The IEEEtran \ifCLASSOPTIONcaptionsoff conditional can also be used
% later in the document, say, to conditionally put the References on a
% page by themselves.


% *** PDF, URL AND HYPERLINK PACKAGES ***
%
\usepackage{url}
% url.sty was written by Donald Arseneau. It provides better support for
% handling and breaking URLs. url.sty is already installed on most LaTeX
% systems. The latest version and documentation can be obtained at:
% http://www.ctan.org/pkg/url
% Basically, \url{my_url_here}.


% *** Do not adjust lengths that control margins, column widths, etc. ***
% *** Do not use packages that alter fonts (such as pslatex).         ***
% There should be no need to do such things with IEEEtran.cls V1.6 and later.
% (Unless specifically asked to do so by the journal or conference you plan
% to submit to, of course. )


% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}


\begin{document}
%
% paper title
% Titles are generally capitalized except for words such as a, an, and, as,
% at, but, by, for, in, nor, of, on, or, the, to and up, which are usually
% not capitalized unless they are the first or last word of the title.
% Linebreaks \\ can be used within to get better formatting as desired.
% Do not put math or special symbols in the title.
\title{Optimal Neural Network Feature Selection for Forecasting of spatial-temporal Series}
%
%
% author names and IEEE memberships
% note positions of commas and nonbreaking spaces ( ~ ) LaTeX will not break
% a structure at a ~ so this keeps an author's name from being broken across
% two lines.
% use \thanks{} to gain access to the first footnote area
% a separate \thanks must be used for each paragraph as LaTeX2e's \thanks
% was not built to handle multiple paragraphs
%


% \author{Michael~Shell,~\IEEEmembership{Member,~IEEE,}
        % John~Doe,~\IEEEmembership{Fellow,~OSA,}
        % and~Jane~Doe,~\IEEEmembership{Life~Fellow,~IEEE}% <-this % stops a space
% \thanks{M. Shell was with the Department
% of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta,
% GA, 30332 USA e-mail: (see http://www.michaelshell.org/contact.html).}% <-this % stops a space
% \thanks{J. Doe and J. Doe are with Anonymous University.}% <-this % stops a space
% \thanks{Manuscript received April 19, 2005; revised August 26, 2015.}}

\author{Eurico~Covas,~\IEEEmembership{}
	    Emmanouil~Benetos~\IEEEmembership{}% <-this % stops a space
\thanks{E. Covas is with the CITEUC, Geophysical and Astronomical Observatory, University of Coimbra, 3040-004, Coimbra, Portugal, and
Queen Mary University London, 10 Godward Square, Mile End Rd, London E1 4FZ,  e-mail: (eurico.covas@mail.com)}% <-this % stops a space
\thanks{E. Benetos is with Queen Mary University London, 10 Godward Square, Mile End Rd, London E1 4FZ,
e-mail: (emmanouil.benetos@qmul.ac.uk)}% <-this % stops a space}\thanks{Manuscript received December 31, 2017; revised December 31, 2017.}
\thanks{Manuscript received x, 2018; revised xx, 2018.}}

% note the % following the last \IEEEmembership and also \thanks -
% these prevent an unwanted space from occurring between the last author name
% and the end of the author line. i.e., if you had this:
%
% \author{....lastname \thanks{...} \thanks{...} }
%                     ^------------^------------^----Do not want these spaces!
%
% a space would be appended to the last name and could cause every name on that
% line to be shifted left slightly. This is one of those "LaTeX things". For
% instance, "\textbf{A} \textbf{B}" will typeset as "A B" not "AB". To get
% "AB" then you have to do: "\textbf{A}\textbf{B}"
% \thanks is no different in this regard, so shield the last } of each \thanks
% that ends a line with a % and do not let a space in before the next \thanks.
% Spaces after \IEEEmembership other than the last one are OK (and needed) as
% you are supposed to have spaces between the names. For what it is worth,
% this is a minor point as most people would not even notice if the said evil
% space somehow managed to creep in.


% The paper headers
\markboth{}%
{Covas \MakeLowercase{\textit{et al.}}: Optimal Neural Network Feature Selection for forecasting of spatial-temporal series}
% The only time the second header will appear is for the odd numbered pages
% after the title page when using the twoside option.
%
% *** Note that you probably will NOT want to include the author's ***
% *** name in the headers of peer review papers.                   ***
% You can use \ifCLASSOPTIONpeerreview for conditional compilation here if
% you desire.


% If you want to put a publisher's ID mark on the page you can do it like
% this:
%\IEEEpubid{0000--0000/00\$00.00~\copyright~2015 IEEE}
% Remember, if you use this you must call \IEEEpubidadjcol in the second
% column for its text to clear the IEEEpubid mark.


% use for special paper notices
%\IEEEspecialpapernotice{(Invited Paper)}


% make the title area
\maketitle

% As a general rule, do not put math, special symbols or citations
% in the abstract or keywords.
\begin{abstract}
We show empirical evidence on how to construct the optimal feature selection or architecture for the input layer
on a neural network for the propose of forecasting spatial-temporal signals.
The approach is based on results from dynamical systems theory, namely the
non-linear embedding theorems. We demonstrate for a variety of
two dimensonal signals with one spatial dimension and one time dimension, and show
that the optimal input layer seems to consist
of a two dimensional grid, with spatial/temporal lags determined by the minimum of the mutual information of the spatial/temporal signal
subsets
and the number of points taken in space/time decided by the embedding dimensional of signal. We present evidence of this conjecture
by running a Monte
Carlo simulation of several combinations of input layer architectures and showing that the one predicted by the non-linear embedding
theorems seems to be optimal or close of optimal. In total we show evidence in four unrelated systems: a series of coupled
H\'{e}non maps; a series of couple Ordinary Differential Equations (Lorenz-96) phenomenologically modelling atmospheric dynamics;
the Kuramoto-Sivashinsky equation and finally real data from sunspot areas in the Sun (in latitude and time) from 1870 to today.
\end{abstract}

% Note that keywords are not normally used for peerreview papers.
\begin{IEEEkeywords}
IEEE, IEEEtran, journal, \LaTeX, paper.
\end{IEEEkeywords}


% For peer review papers, you can put extra information on the cover
% page as needed:
% \ifCLASSOPTIONpeerreview
% \begin{center} \bfseries EDICS Category: 3-BBND \end{center}
% \fi
%
% For peerreview papers, this IEEEtran command inserts a page break and
% creates the second title. It will be ignored for other modes.
\IEEEpeerreviewmaketitle


\section{Introduction}
% The very first letter is a 2 line initial drop letter followed
% by the rest of the first word in caps.
%
% form to use if the first word consists of a single letter:
% \IEEEPARstart{A}{demo} file is ....
%
% form to use if you need the single drop letter followed by
% normal text (unknown if ever used by the IEEE):
% \IEEEPARstart{A}{}demo file is ....
%
% Some journals put the first two words in caps:
% \IEEEPARstart{T}{his demo} file is ....
%
% Here we have the typical use of a "T" for an initial drop letter
% and "HIS" in caps to complete the first word.
\IEEEPARstart{G}{iven} a physical data set one of the most important questions one can pose is:``Can we predict the future?''
This question can be put forward irrespectively of the fact that we may already have some insight or even be
certain on what the exact model behind some or all the observed variables is. For example, for chaotic dynamical systems
\cite{0813340853,2004icti.book.....M}, we may even have the
underlying dynamics but still find it hard to predict the future, given that chaotic systems have exponential sensitivity to initial
conditions. The more chaotic a system is (as measured by the positiveness of their largest Lyapunov exponents
\cite{1985PhyD...16..285W,1994PhLA..185...77K}) the harder it gets to
predict the future, even within very short time horizons. In the limit case of a random system, it is not possible to predict the
future, although one can opine on certain future statistics\cite{9780486693873}.
For the case of weakly chaotic systems, there is an extensive literature on
forecasting methods ranging from linear approximations\cite{1451722};
truncated functional expansion series\cite{Powell:1987:RBF:48424.48433,Broomhead1988MultivariableFI}; non-linear embeddings
\cite{PhysRevLett.59.845};
auto-regression methods\cite{0130607746}; hidden markov models \cite{1165342}
to state-of-the-art neural networks and deep learning methodologies \cite{LANGKVIST201411}.

Most literature on forecasting chaotic signals is dedicated to a single time series, or treat a collection of related time series as a
non-extended set, i.e. a multivariate set of discrete as opposed to continous time series.
For forecasting spatial-temporal chaos we refer the reader to
\cite{doi:10.1063/1.165894,PhysRevE.51.R2709,PhysRevLett.85.2300,Parlitz2000NonlinearPO,2000PhRvL..84.1890P,Xia2006APF,covas2016}
and references therein.
Even rarer are attempts to forecast spatial-temporal chaos using neural networks and deep learning methodologies
\cite{covaspeixinhojoao,2017arXiv170805094M,2017arXiv171100636M,2017arXiv171110566R,2017arXiv171110561R,2017arXiv171009668L,
2018JCoPh.357..125R,2018arXiv180106637R}, although this field of research is clearly growing at the moment.
Nonetheless, this area of research is of importance, as most physical systems are spatially extended, e.g.
the atmospheric system driving the weather \cite{9780521857291}; the solar dynamo driving the Sun's sunspots \cite{9780198512905};
and the influence of sunspots on the Earth's magnetic field via the solar wind, coronal mass ejections and
solar flares -- the so called space weather \cite{1851RSPT..141..123S, 1852RSPT..142..103S,
1979P&SS...27.1001S,1983SoPh...89..195E,1965P&SS...13....9P,2000AdSpR..26...27W,2003A&AT...22..861B,2005GeoRL..3221106S,2005SpWea...3.8C01K,2006GMS...165..367T,2006GeoRL..3318101H,
2009SunGe...4...55C,2011SpWea...9.6001C,2013EGUGA..1510865W,2015SpWea..13..524S}.

The reasons why spatial-temporal chaos
is so difficult to forecast are: first the size of the attractor -- usually quite large; and second how to choose
the variables to use for forecasting, i.e., is there enough information on the same point back in time to derive the future of that
particular point, or do spatial correlations and spatial propagation affect it in a way that one must take into account some
spatial and temporal neighbours set to forecast the future, and if so, can that set be defined and how can it be constructed?
It is this last question that we investigate in this paper, in the particular context of spatial-temporal forecasting using neural networks. Feature extraction and the design of the architecture of the input layer for a neural network is art form, relying mostly
on trial and error and domain knowledge. For forecasting of time series a simple approach consists
of designing the input layer as a vector of previous data using a time delay, the
time delay neural network method \cite{Waibel:1990:PRU:108235.108263, luk2000study, Frank2001, OH2002249, 1009-1963-12-6-304,
inputlayer}. For spatial-temporal series, one can generalize it to include temporal and spatial delays \cite{covas2016,covaspeixinhojoao}.
This is where the connection to dynamical systems can be useful.
In 1981, Takens established the theoretical background \cite{1981LNM...898..366T} in the Takens embedding theorem
for a mathemmatical method to reconstruct the dynamics of the underlying attractor of a chaotic dynamical system
from a time ordered sequence of data observations. Notice the reconstruction conserves the properties of the
original dynamical system up to a diffeomorphism.
 Further developments established a
series of  theorems by \cite{key1503303m}, \cite{1981LNM...898..366T, 1981LNM...898..230M} and \cite{1991JSP....65..579S}.
These theorems serve the basis for a non-linear embedding and forecasting on the original variables.
The theorems and related articles propose to use a time delay based on the first minima of the {\em mutual information} - see
\cite{Fraser86, abarbanel1997analysis, opac-b1092652}) and to use the number of points using
 the method of {\em method of false nearest neighbours} detection suggested by
\cite{1992PhRvA..45.3403K} and reported in detail in
\cite{1992PhRvA..45.7058M, 1993RvMP...65.1331A, 1996PhT....49k..86A, abarbanel1997analysis}.
Using this theoretical framework, we propose that this non-linear embedding method can be used to indicate
what the best way is to contruct the input layer for a neural network.

In this paper we show empirical evidence for an optimal architecture of the input layer of a neural network which tries
to forecast or predict a spatial-temporal signal. Here we show the empiral evidence for two particular cases of
two dimensional data series $s^n_m$ (one spatial,
one temporal dimension). By two dimensional data series we mean a scalar field which can be
defined by a $N\times M$ matrix with components $s^n_m \in \mathbb{R}$.

\subsection{Neural networks for time-series forecasting}

\cite{McCulloch1943}

Describe the use of feedforward artificial neural networks, then recurrent networks and other methods (put all references).
Put all references to pure time only forecasts.

\subsection{Neural networks for spatial-temporal forecasting}

There has been several attemps to forecast spatial-temporal data series in the literature ( add references \cite{2017arXiv170805094M,
2017arXiv171100636M, ghaderi2017deepforecast}).
The references above
confine themselves to articles that attempt to forecast the actual full scalar field  $s^n_m$, as opposed to the ongoing research on
pattern recognition in moving images (2D and 3D), which attempt to pick particular features in images (e.g. car, pedestrian, bicycle,
person) and to forecast where those features will be in subsequent images within the particular moving sequence (ADD REFERENCES)
 or research
on word sequences (ADD REFERENCES)

\subsection{Sunspot and solar data series forecasting using neural networks}

Neural networks, among other forecasting methods (add references) has been used extensively to try to predict sunspots and related
data series. Mostly this has been done in time only (add references). There are a few examples (add references) of actual spatial-temporal
forecasts using neural networks (see mine and the one on MDI/HDI data).

\subsection{Input layer architecture for neural networks for spatial-temporal forecasting}

All of the references above on neural network forecasting of spatial-temporal data series either use a simple delay based architecture
for the input layer, or use a time delay based on the first minima of the {\em mutual information} as proposed in
\cite{Fraser86, abarbanel1997analysis, opac-b1092652}) and/or use the number of points dictated by
the embedding theorems by \cite{key1503303m}, \cite{1981LNM...898..366T, 1981LNM...898..230M} and \cite{1991JSP....65..579S}
using the method of {\em method of false nearest neighbours} detection suggested by
\cite{1992PhRvA..45.3403K} and reported in detail in
\cite{1992PhRvA..45.7058M, 1993RvMP...65.1331A, 1996PhT....49k..86A, abarbanel1997analysis}. There are other researchers
who use more complex neural networks without being explicity about time or spatial delays \cite{2017arXiv171205293C}. ADD REFERENCES.

However, as far as we are aware, all the references to using the embedding theorems and the related  mutual information
method and the false nearest neighbours method seems to be not justified, i.e., the approach is explained but not proven
either theoretically or empirically. Here we use two examples of spatial-temporal signals, one a real world data series, another
a synthetic signal, to demonstrate that there seems to be empirical evidence for an underlying theorem for what the optimal
 neural network feature selection for forecasting of spatial-temporal series should be.


REWRITE below and put into above subsections
\\


Several authors \cite{
1990EOSTr..71..677K,
1991PhDT.......158W,
Weigend92HubermanRumelhart,
1993AdSpR..13..447M,
1995JGR...10021735M,
1994VA.....38..351C,
0305-4470-28-12-012,
1995ApJ...444..916C,
Koskela96timeseries,
1996SoPh..168..423F,
1996AnGeo..14...20F,
1996ITNN....7..501P,
1998GeoRL..25..457K,
1998JGR...10329733C,
1998NewAR..42..343C,
Verdes2000,
2004SoPh..221..167V,
2001GMS...125..201L,
2002PhRvE..66f6701S,
2004SoPh..224..247M,
2005JASTP..67..595G,
2004SPIE.5497..542A ,
2005MmSAI..76.1030Q,
2005SoPh..227..177A,
2006JASTP..68.2061M,
2007SoPh..243..253Q ,
2013Ap&SS.344....5A,
xie2006hybrid,
2006SunGe...1a...8M,
2007IJMPC..18.1839E,
1997SPIE.3077..116P,
2006AGUFMSH21A0315L,
2008cosp...37.3467W,
gang2007sunspot,
2009JASTP..71..569U,
2010BAAA...53..241F,
1999BAAA...43...23P,
2011RAA....11..491A,
2010cosp...38.2153A,
2011CRGeo.343..433C,
JiangS11,
2012cosp...39.1194M,
Chandra:2012:CCE:2181341.2181747,
park2009prediction ,
kim2010sunspot,
moghaddam2013sunspot,
2012EPJP..127...43C,
liu2012sunspot,
Gkana201579,
DBLP:conf/ijcnn/ParsapoorBS15,
DBLP:conf/aaai/ParsapoorBS15,
raios
} have already
attempted to use neural networks to forecast aspects of the sunspot cycle, although none in both space and time, having restricted themselves to
using these neural networks to forecast mostly either the sunspot number or the sunspot areas as a function of time.

Takens established the theoretical background \cite{1981LNM...898..366T} in the Takens embedding theorem. Further developments established a
series of  theorems by \cite{key1503303m}, \cite{1981LNM...898..366T, 1981LNM...898..230M} and \cite{1991JSP....65..579S}.

Some authors discuss the use of either mutual information and/or embedding dimension as a constraint on the input layer, i.e. the feature selection.
\cite{annunziato, Gkana201579, Zachilas2015, Sun2010109, HUANG20108590, 298224, Frank2001, 1998GeoRL..25..457K, BUHAMRA2003805, chandra2012cooperative, DBLP:journals/corr/MaslennikovaB14, sauter2010spatio, JiangS11, inputlayer, 1997IJMPC...8.1345K,
1009-1963-12-6-304, Verdes2000, 1996SoPh..168..423F, 2007AdG....10...67L, Chandra:2012:CCE:2181341.2181747, raios, maaß2003mathematical} used the
mutual information and/or the false nearest neighbours' methods to estimate the suitable embedding parameters.

\cite{0305-4470-28-12-012, 1995ApJ...444..916C, 1998GeoRL..25..457K, Zachilas2015, Chandra:2012:CCE:2181341.2181747, Gkana201579, raios} attempted to forecast the solar sunspot number using neural networks and they used the
embedding dimension of the sunspot time series as way to define the architecture of the input layer.

\cite{Simon:2007:HDS:1230147.1230294} generalize the mutual information approach to higher dimensions but do not connect it to the problem of the
neural network input layer architecture optimization.

There are also papers \cite{articleRagulskis} that try to use neural networks to determine the optimal embedding and time delay for the purposes of local
reconstruction of states with a view to forecast.

There are also papers \cite{Xia2006APF} that use Support Vector Machines (SVMs) to forecast in space and time and use time delays and embedding approachs to define
the states vectors.

In fact Parlitz and Merkwirth \cite{Parlitz2000NonlinearPO} in the
ESANN 2000 meeting proceddings paper mentioned that local reconstruction of states ``\ldots may also serve as
a starting point for deriving local mathematical models in terms of polynomials,
radial basis functions or neural networks.''. Here we attempt to show empirical evidence that this is not just
a starting point, but the optimal neural network input architecture.

\hfill Eurico Covas

\hfill December 31, 2017

\section{Neural Network architecture}

The neural network architecture we choose to demonstrate our possible conjecture is a form of the basic feedforward neural network,
sometimes called the time-delayed neural network \cite{Waibel:1990:PRU:108235.108263}, trained using the so-called back-propagation
algorithm \cite{10.1007/BFb0006203, 1986Natur.323..533R, 58337}. We focus on spatial-temporal series, so we have extended the usual time-delayed neural network
to be a time and space delayed network. The overall architecture of the network is depicted in detail in Fig.\ \ref{architecture}.

\begin{figure}
\resizebox{\hsize}{!}{\includegraphics{architecture.pdf}}
\caption{Forecasting method illustration. The neural network is made of an input layer, one or more hidden layer(s) and one output layer.
In this article, we use only one hidden layer and the output layer is made of a single neuron. Each input pattern  $x(i)$ is sent to the
input layer, then each of the hidden neurons values is calculated from the sum of the product of the weights by the inputs $\sum w(i,j) x(i)$
and passed via the activation function. Then the output is made by the product of the second set of
weights times the hidden node values $\sum w'(i,j) y(i)$ again passed to another (or the same) activation function.
Each input pattern  $x(i)$ is actually a matrix constructed using an embedding space of
spatial and temporal delays, calculated from the actual physical spatial-temporal data values $s(n,m)$. After many randomly chosen input patterns
are passed via the neural network, the weights hopefully converge to an optimal training value.}
\label{architecture}
\end{figure}

Under this architecture, we use the ideas proposed in \cite{Parlitz2000NonlinearPO} to construct a grid of input values
which are then fed to the neural network to produce a single output, the future state. Formally,
let $n=1,...,N$ and $m=1,...,M$.
Consider a spatial-temporal data series ${\bf s}$ which can be
defined by a $N\times M$ matrix with components
$s^n_m \in \mathbb{R}$. To these components, we will call {\em states} of the spatial-temporal series.
Consider a number $2I\in \mathbb{N}$ of neighbours in space of a given
$s^n_m$ and a number $J\in \mathbb{N}$ of temporal past neighbours relative to $s^n_m$ (see Fig.\ \ref{NeuralNetwork} for details).
For each $s^n_m$, we define the input (feature) vector ${\bf x} (s^n_m)$ with components given by
$s^n_m$, its  $2I$ spatial neighbours and its $J$ past temporal
neighbours, and with $K$ and $L$ being the spatial and temporal lags:
\begin{multline}
{\bf x}(s^n_m)=\{s^n_{m-I K }, ...,s^n_m, ..., s^n_{m+I K},
s^{n-L}_{m-I K},..., s^{n-L}_{m},..., s^{n-L}_{m+I K},
...\\ \cap
s^{n-J L}_{m-I K},..., s^{n-J L}_{m},..., s^{n-J L}_{m+I K} \}
\label{embedding}
\end{multline}

So, the input is a
$(2 I+1)(J+1)$ vector ${\bf x}(s^n_m)$ and the target (output) to train the network is the value $s^{n+1}_{m}$.
In Fig.\ \ref{NeuralNetwork} we show the details of this method. We train the network using stochastic gradient
back-propagation (using momentum and weigh decay for regularization) by running a stochastic batch
where we randomly sample pairs of inputs and outputs from the training set: ${\bf x}(s^n_m)$ and $s^{n+1}_{m}$, respectively. Then at test time
we choose inputs ${\bf x}(s^n_m)$, such that $n=N_{train}$, $N_{train}$ being the number of temporal slices on the training set.

\begin{figure*}
\centering
\resizebox{\hsize}{!}{\includegraphics{NeuralNetwork.pdf}}
\caption{Forecasting method illustration. One constructs an embedding space using
space and time delays, then assemble randomly positioned grid input patterns within the training set to pass to the neural network (in this figure
we show 3 randomly selected input patterns).
The input is a
$(2 I+1)(J+1)$ vector ${\bf x}(s^n_m)$ and the target (output) to train the network is the value $s^{n+1}_{m}$.
After training with a sequence
of patterns $p(i), p(i+1), p(i+2), \ldots$ then the patterns adjacent to the forecast set are used to calculate the outputs
to compare against the forecast. To forecast the $n+2$ slice we concatenate the previously predicted $n+1$ and progress accordingly.}
\label{NeuralNetwork}
\end{figure*}

As for the back-propagation hyperparameters, we included an adaptive learning rate $\eta_n=\eta/(1+n/10000)$, where the parameter $\eta$ is the initial learning rate and $\eta_n$ is the learning rate used at time step
$n$, we included a momentum $\alpha$ and a weight decay $\rho$.
In addition we use one hidden layer with $N_h$ nodes. A further  hyperparameter is the choice of the activation function, we use either ReLu or a sigmoid function depending on the
 test case we are working with.
We also normalize the data before passing it throught the neural network, in most cases we scale it in linear fashion
$x \to \alpha_{nor} + x/\beta_{nor}$, and in the case of real physical data as we will see later, we scale it in logarithmic fashion it by $x \to \alpha_{nor}+\frac{\ln(1+x)}{\beta_{nor}}$, where $x$ is the inital data,
and $\alpha_{nor}$ and $\beta_{nor}$ are the arbitrary shift and scaling constants, respectively. For the weight (and bias)
initialization we choose random numbers with a constant distribution between $[0,1]$ and shifted by $\alpha_{rng}$ and scaled
by $\beta_{rng}$. The final hyperparameter is the number of steps taken on the stochastic gradient descent (equivalent to a mini-batch approach of $N_{\textnormal{batch}}=1$) which we denote by $N_{\textnormal{steps}}$.  All of these hyperparameters are calibrated and fixed
before we do any simulations with respect to the parameters $I$, $J$, $K$, $L$, which are autocalibrated by the below mentioned
methods derived from dynamical systems theory. In this sense these are not hyperparameters of the neural network.


We then compare the goodness of fit by first visual inspection and second by numerically calculating the so-called
structural similarity $\textnormal{SSIM}(x,y)$ which has been proposed by \cite{Wang04imagequality} and used already in the context of spatial-temporal forecasting in \cite{covas2016, covaspeixinhojoao}. It has also been used in the context of deep learning used for enhancing
resolution on two dimensional images \cite{2015arXiv150100092D} and restoring missing data in images \cite{2018arXiv180208369Z}. For details on the SSIM measure see \cite{Wang04imagequality,2009ISPM...26...98W, 2012ITIP...21.1488B}.
The SSIM index is an metric quantity used to calculate the perceived quality of digital images and videos.  It
allows two images to be compared and provides a value of their similarity - a value of $\textnormal{SSIM}=1$ corresponds to the case of two
perfectly identical images. We use it by calculating the $\textnormal{SSIM}(x,y)$ between the entire test set and the predicted set.

Although most papers using neural networks for forecasting in time, or in space and time (ADD REFERENCES) may use implicitly
(convolutional ones?) or explicitly (TDNNs) (ADD REFERENCES), as far as the authors are aware there has been no firm evidence,
either theoretical or empirical on how to choose the inputs for a neural network (see however \cite{1555956} who show how
the forecasting error for a pure time series prediction changes with the delay and the number of time delay points used as an input).

Here we propose that the optimal time delay/spatial delays ($L$ and $K$, respectively) must be the ones based on the first minima of the {\em mutual information}
\cite{Fraser86, abarbanel1997analysis, opac-b1092652} and that the optimal number of temporal/spatial points to use
($J$ and $I$, respectively) must be the
ones based on the
 the method of {\em method of false nearest neighbours} detection \cite{1992PhRvA..45.3403K, 1992PhRvA..45.7058M, 1993RvMP...65.1331A, 1996PhT....49k..86A, abarbanel1997analysis}. We conjecture that as any set of architectures ``approach'' this optimal
 architecture, then the $\textnormal{SSIM} \to 1$. In the case of finite training sets and/or noisy training sets $SSIM \to x<1$, where $x$ is the best
 forecast possible given the data set. Visually we believe that the  SSIM versus some reasonable metric constructed to
 represent the distance between any architecture and the optimal
 architecture will show a skewed bell shape as depicted in Fig.\ \ref{conjecture}. In this conjecture we use
 the most obvious candidate to represent the distance between any architecture and the optimal
 architecture, the euclidian distance between the architecture four parameters $d_e=\sqrt{(I-I^*)^2+(J-J^*)^2+(K-K^*)^2+(L-L^*)^2}$.


\begin{figure}
\resizebox{\hsize}{!}{\includegraphics{conjecture-crop.pdf}}
\caption{Our main conjecture. For a infinite noiseless training set, the SSIM approaches $\textnormal{SSIM} \to 1$. For real data sets,
there is a dispersion of the SSIM versus some reasonable metric constructed to
 represent the distance between any architecture (e.g.\ $d_e\sqrt{(I-I^*)^2+(J-J^*)^2+(K-K^*)^2+(L-L^*)^2}$).
}
\label{conjecture}
\end{figure}


% needed in second column of first page if using \IEEEpubid
%\IEEEpubidadjcol

\section{Monte Carlo results}

In order to empirically substantiate our conjecture we take four examples of spatial-temporal series and attempt
to forecast using our feedforward neural network. First we calculate the optimal time delay/spatial delays ($L^*$ and $K^*$, respectively)
using the first minima of the  mutual information optimal number of temporal/spatial points to use
($J^*$ and $I^*$, respectively) using the  method of method of false nearest neighbours.
We then use a Monte Carlo simulation on each one of our four examples, sampling randomly values
of $I$, $J$, $K$, $L$ and calculating the values $d_e(I,J,K,L)$ and $\textnormal{SSIM}(I,J,K,L)$.

We first take a physical system example, a real data example, and then we progress from ``simpler'' systems (maps)
capable of generating spatial-temporal chaos
to more ``complex'' systems (Ordinary Differential Equations - ODEs) to really ``complex'' systems (Partial
Differential Equations - PDEs). This is partially motivated by results in the literature that show
that general universalities are present in different levels of simplification of physical models, from the original PDEs
to truncated ODE expansions (e.g. Galerkin expansions \cite{2001Chaos..11..404C}) to the most extreme simplications such
as maps which capture the essence of the problem. In all cases we take examples with one spatial and one temporal dimension. However we believe
that our conjecture will extend to multiple spatial and one temporal dimensional systems.

\subsection{Sunspot data - a physical system example}

% code and results in
% C:\Users\eurico\SunspotAnalysis\NeuralNetworks\SolarForecastingNeuralNetworks41.xlsm

The first example we take is a physical real data example based on a previous paper of one of us \cite{covaspeixinhojoao}, where a
neural network using this type of architecture above (Fig.\ \ref{architecture}) was used to forecast sunspot areas $A(t,\theta)$ in our
Sun in both space ($\theta$ latitude) and time (Carrington Rotation index). We take as a ``training set'' the data from the year 1874 to
approximately 1997 (i.e.\ the first 1646 Carrington Rotations). We then attempt to reproduce or forecast the sunspot area butterfly
diagram from Carrington Rotation 1921 to 2162 (the last one corresponding approximately to the year 2015); that is, we use 1646 time
slices (\textasciitilde  122.92 years) to reproduce the next 242 time slices (\textasciitilde  18.07 years). The training set
corresponds to around 12 solar cycles (cycle 11 to 22), while the ``forecasting set'' (or validation set) equates to around 1.5 cycles
(cycle 23 and half of cycle 24). The entire dataset, including the training and forecasting sets, is a grid $x^i_j=x(i,j)$, with
$i=1888$ and $j=50$. The training set is a grid $x(1646,50)$. For this case the optimal values were $I^*=2$, $J^*=6$, $K^*=9$ and
$L^*=70$.
The hyperparameters of the neural network were:
$N_h=70$, $\eta=0.3$, $\alpha=0.01$, $\rho=0$, a logarithmic normalization of the inputs scaled
 with $\alpha_{nor} = 10$ and $\beta_{nor} = 0$, weight initialization with $\alpha_{rng} = 10^{-2}$ and $\beta_{rng} = -0.5$
 and $N_{\textnormal{steps}}=1,000,000$. We used the logistic sigmoid function as the activation on both the hidden and output layers.
The results are depicted in Fig.\ \ref{MonteCarloSSIMversusParameterMetricDistance}.  It shows a dispersion as conjectured and
a convergence to the highest $\textnormal{SSIM}$ value we could obtain for this particular slicing of the training and test sets
$\textnormal{SSIM}= 0.836876152$.

\begin{figure}[!ht]
\centering
\resizebox{\hsize}{!}{\includegraphics[]{MonteCarloSSIMversusParameterMetricDistance-crop.pdf}}
\caption{Monte Carlo simulation of different architectures of the input layer for the neural network forecast for the sunspot data.
It shows the structural similarity (SSIM) against how far (in an Euclidean space metric) the particular parameters of a particular
run were from the supposely optimal architecture parameters (red dot).}
\label{MonteCarloSSIMversusParameterMetricDistance}
\end{figure}


% An example of a floating figure using the graphicx package.
% Note that \label must occur AFTER (or within) \caption.
% For figures, \caption should occur after the \includegraphics.
% Note that IEEEtran v1.7 and later has special internal code that
% is designed to preserve the operation of \label within \caption
% even when the captionsoff option is in effect. However, because
% of issues like this, it may be the safest practice to put all your
% \label just after \caption rather than within \caption{}.
%
% Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
% option should be used if it is desired that the figures are to be
% displayed while in draft mode.
%


\subsection{Coupled H\'{e}non maps - a discrete-time dynamical system}

% code and results in
% C:\Users\eurico\SunspotAnalysis\SpatialTemporalFeatureSelectionOptimalApproach\KuramotoSivashinsky27_HenonMaps.xlsm

Motivated by having a real case from a physical system, we then tried to investigate if this same conjecture
holds in a very simplifed example of a spatial-temporal model. Coupled maps are widely used as models of spatial-temporal
chaos and pattern/structure formation \cite{1989PThPS..99..263K,1989JSP....54.1489M,9780471937418}.

Following \cite{2000PhRvL..84.1890P,Parlitz2000NonlinearPO} we then take a lattice of $M=100$
coupled H\'{e}non maps:
\begin{IEEEeqnarray}{lCr}
\label{henon}
u_m^{n+1} = 1 - 1.45 \left[ \frac{1}{2} u^n_m + \frac{u_{m-1}^{n}+u_{m+1}^{n}}{4} \right]^2 + 0.3 v_m^n, &&\\
v_m^{n+1} = u_m^n.&& \nonumber
\end{IEEEeqnarray}
with fixed boundary conditions $u^n_1=u_M^n=\frac{1}{2}$ and $v_1^n=v_M^n=0$.

We run the simulation for $N=531$ time steps, and divided the set into $N_{\textnormal{train}}=500$ time steps for the training set
and $N_{\textnormal{test}}=31$ time steps for the test set. The other parameters of the neural network were:
$N_h=10$, $\eta=0.1$, $\alpha=0$, $\rho=0$, a linear input normalization scaling with $\alpha_{nor} = 2.947992$, $\beta_{nor} = 0.515$, $\alpha_{rng} = 10^{-3}$, $\beta_{rng} = -0.5$ and $N_{\textnormal{steps}}=1,000,000$. We used the ReLu function as the activation on both the hidden and output layers.

For this case the optimal values were $I^*=1$, $J^*=3$, $K^*=2$ and $L^*=3$. The results
are depicted in Fig.\ \ref{MonteCarloSSIMversusParameterMetricDistance}.  It shows a dispersion as conjectured and a convergence to the highest $\textnormal{SSIM}$ value we could obtain for this particular slicing of the training and test sets $\textnormal{SSIM}=
0.71139101$.


\begin{figure}[!ht]
\centering
\resizebox{\hsize}{!}{\includegraphics[]{MonteCarloSSIMversusParameterMetricDistance100HenonCoupledMaps-crop.pdf}}
\caption{Monte Carlo simulation of different architectures of the input layer for the neural network forecast for a series of 100 coupled
H\'{e}non maps.
It shows the structural similarity (SSIM) against how far (in an Euclidean space metric) the particular parameters of a particular
run were from the supposely optimal architecture parameters (red dot). The green line (trendline) seems to show that as the parameters
of a randomly choosen architecture get close to the supposely optimal architecture ones, the SSIM converges to what seems to be the
best possible forecast value given the limited dataset.}
\label{MonteCarloSSIMversusParameterMetricDistance100HenonCoupledMaps}
\end{figure}

Results seems to suggest the same structure as depicted in our conjecture diagram and in the previous results for sunspots.

\subsection{Coupled Ordinary Differential Equations - Lorenz-96 model}

% code in
% C:\Users\eurico\SunspotAnalysis\SpatialTemporalFeatureSelectionOptimalApproach\lorenz4D\lorenz4Drun.m
%%% the Lorenz model is: (cyclical)
% dX[j]/dt=(X[j+1]-X[j-2])*X[j-1]-X[j]+F
%J=40;               %the number of variables
%h=0.05;             %the time step
% results in
% C:\Users\eurico\SunspotAnalysis\SpatialTemporalFeatureSelectionOptimalApproach\KuramotoSivashinsky23_Lorenz.xlsm

For the spatially extended ODEs model we used a well known 40 couple ODE dynamical system proposed by Edward Lorenz in 1996 \cite{articleLorenz96}

\begin{equation}{lCr}
\label{lorenz96equations}
\frac{d\,x_j}{d\,t}=\left( x_{j+1} - x_{j-2} \right ) x_{j-1} - x_j + F, j=1, \ldots, N=40,
\end{equation}
where $x_{-1}=x_{N-1}$, $x_0=x_N$ and $x_{N+1}=x_1$. We use the forcing $F=5$ to get some interesting behaviour in space and time. We used a time step $\Delta t=0.05$ and we have integrated this equation using Matlab (ADD REFERENCE).

\begin{figure}[!ht]
\centering
\resizebox{\hsize}{!}{\includegraphics[]{MonteCarloSSIMversusParameterMetricDistanceLorenz96-crop.pdf}}
\caption{Monte Carlo simulation of different architectures of the input layer for the neural network forecast for the 40-ODE Lorenz 96 system.
It shows the structural similarity (SSIM) against how far (in an Euclidean space metric) the particular parameters of a particular
run were from the supposely optimal architecture parameters (red dot). The green line (trendline) seems to show that as the parameters
of a randomly choosen architecture get close to the supposely optimal architecture ones, the SSIM converges to what seems to be the
best possible forecast value given the limited (and noisy) dataset.}
% https://en.wikipedia.org/wiki/Lorenz_96_model
\label{MonteCarloSSIMversusParameterMetricDistanceLorenz96}
\end{figure}

We run the simulation for $N=531$ time steps, and divided the set into $N_{\textnormal{train}}=500$ time steps for the training set
and $N_{\textnormal{test}}=31$ time steps for the test set. The other parameters of the neural network were:
$N_h=10$, $\eta=0.05$, $\alpha=0.001$, $\rho=0$, a linear normalization input scaling with $\alpha_{nor} = 10$ and $\beta_{nor} = 0.430$, weight initialization with $\alpha_{rng} = 10^{-3}$ and $\beta_{rng} = -0.5$ and $N_{\textnormal{steps}}=100,000$. We used the ReLU function as the activation on both the hidden and output layers.

For this case the optimal values obtained before the Monte Carlo simulation were $I^*=2$, $J^*=2$, $K^*=1$ and $L^*=9$. The results
are depicted in Fig.\ \ref{MonteCarloSSIMversusParameterMetricDistanceLorenz96}.  It shows a dispersion as conjectured and a convergence to the highest $\textnormal{SSIM}$ value we could obtain for this particular slicing of the training and test sets $\textnormal{SSIM}=
0.861844038$. Results seems to suggest the same structure as depicted in our conjecture diagram and in the previous results for sunspots
and the coupled H\'{e}non maps.

\subsection{Partial Differential Equations - Kuramoto-Sivashinsky model}

% code in
% C:\Users\eurico\SunspotAnalysis\SpatialTemporalFeatureSelectionOptimalApproach\KSEproject\matlabKSintrinsicSolver.m
% from http://chaosbook.org/extras/KSEproject/html/index.html
% results in
% C:\Users\eurico\SunspotAnalysis\SpatialTemporalFeatureSelectionOptimalApproach\KuramotoSivashinsky28.xlsm

Finally we take a full PDE system, the Kuramoto-Sivashinsky model \cite{1976PThPh..55..356K,1977AcAau...4.1177S},
a very well known system capable of spatiatemporal chaos
and complex spatial-temporal dynamics. It is a fourth-order nonlinear PDE introduced in the 1970s by Yoshiki
Kuramoto and Gregory Sivashinsky to model the diffusive instabilities
in a laminar flame front.

The model is described by the following equation:

\begin{equation}
\label{kuramotoSivashinskyequation}
\frac{\partial u(x,t)}{\partial t} = \frac{\partial^4 u(x,t)}{\partial x^4}+\frac{\partial^2 u(x,t)}{\partial x^2}+u(x,t)
\frac{\partial u(x,t)}{\partial x},
\end{equation}
where $x \in [-\frac{L}{2},+\frac{L}{2}]$. We have integrated this equation using Matlab (ADD REFERENCE)
using a time step of$\Delta t=0.5$ and $L=22$ fourier modes.

\begin{figure}[!ht]
\centering
\resizebox{\hsize}{!}{\includegraphics[]{MonteCarloSSIMversusParameterMetricDistanceKS_L=22-crop.pdf}}
\caption{Monte Carlo simulation of different architectures of the input layer for the neural network forecast for
Kuramoto-Sivashinsky with $L=22$ system.
It shows the structural similarity (SSIM) against how far (in an Euclidean space metric) the particular parameters of a particular
run were from the supposely optimal architecture parameters (red dot). The green line (trendline) seems to show that as the parameters
of a randomly choosen architecture get close to the supposely optimal architecture ones, the SSIM converges to what seems to be the
best possible forecast value given the limited (and noisy) dataset.}
\label{MonteCarloSSIMversusParameterMetricDistanceKS_L=22}
\end{figure}

We run the simulation for $N=531$ time steps, and divided the set into $N_{\textnormal{train}}=500$ time steps for the training set
and $N_{\textnormal{test}}=31$ time steps for the test set. The other parameters of the neural network were:
$N_h=50$, $\eta=0.1$, $\alpha=0$, $\rho=0$, a linear normalization input scaling with
$\alpha_{nor} = 5.8472$ and $\beta_{nor} =0.5$, weight initialization
with $\alpha_{rng} = 10^{-3}$ and $\beta_{rng} = -0.5$ and $N_{\textnormal{steps}}=1,000,000$.
We used the ReLU function as the activation on both the hidden and output layers.