Maritime Data Mining for Marine Safety Based on Deep Learning: Southern Vietnam Case Study Rudarenje podataka za pomorsku sigurnost na temelju dubokog učenja: studija slučaja Južnog Vijetnama

* High-speed passenger vessels, integrated river and sea vessels, container vessels, oil tankers

Abstract * High-speed passenger vessels, integrated river and sea vessels, container vessels, oil tankers, and other underwater vehicles operating in maritime traffi c are among the types of vessels that must be equipped with AIS and VHF.The safety of navigation is one of the major problems in the maritime sector, particularly in Vietnam.Furthermore, marine traffi c in the seaport zone is a common and diffi cult issue to manage in areas with a high volume of vessel traffi c, mostly in places where the infrastructure supporting navigation is inadequately developed to meet the rapidly growing demands of the contemporary world.Therefore, it is necessary to create an integrated maritime management system to improve the effi ciency of data exploitation and support maritime safety.To address this challenge, this study suggests a Maritime Traffi c State Prediction (MTSP) model to predict traffi c conditions in the channels where real-time data collection is insuffi cient in some specifi c locations.We recommend a deep learning method using Long Short-Term Memory (LSTM) networks to predict the safe path of the vessel in case of missing data segments.The fi ndings have shown that the proposed approach encourages the mining of historical vessel data for maritime traffi c, is ready to be applied, and can easily be implemented in a computer program or a web-based app.

INTRODUCTION / Uvod
Vessel performance, the caliber of the crew, the environment, management issues, etc., all have an impact on the complicated system of waterways in Vietnam.Waterway accidents can cause signifi cant fi nancial losses, casualties, and environmental damage.It is therefore essential to evaluate how safe the waterway traffi c is. High-speed passenger vessels, integrated river and ocean-going vessels, container ships, and oil tankers are among the vessels that must now have AIS and VHF on board [1][2].One of the main issues for the shipping industry, as well as for the security and economy of the entire globe, and Vietnam in particular, is vessel navigation.Some critical points In maritime big data management, marine data is categorised into one or more given classes using classifi cation models.These models are trained on a historical dataset with labels [6].Assigning labels to data items is the process of classifi cation.The goal of the classifi cation challenge is to identify a variety model that allows the determination of the class to which the latest information belongs.In this section, we examine some modern analysis and application methods for processing maritime big data [6][7][8][9][10][11][12][13][14][15][16][17][18][19] as follows:

Overview of maritime data mining connected to classifi cation and analysis algorithms / Pregled rudarenja podataka u pomorstvu povezanih s algoritmima klasifi kacije i analize
-Decision Tree Algorithm: A hierarchical category graph used in classifi cation is called a decision tree, which is based on a series of rules and is a popular tool for data mining and classifi cation, e.g.fuzzy-rough decision trees to learn about the behavior of vessel types [6].The decision tree has the following classifi cation model: First, the internal node is used for testing on an attribute; Second, the leaf node uses a label/description of a class label; And last but not least, the Branch from an internal node with the result of a test on the corresponding attribute.-Naive Bayes Algorithm: Bayes' Theorem is a mathematical theorem that calculates the probability of the occurrence of a random event A given that the related event B has occurred.Naive Bayes Classifi cation (NBC) is the algorithm based on probability calculation applying the Bayes theorem.This algorithm belongs to the Supervised Learning group and is an example of using Bayesian Networks from AIS data [7].-Support Vector Machine Algorithm: Support Vector Machine (SVM) is an algorithm that belongs to the Supervised Learning group that is used to divide classifi cation data into separate groups [8][9].Imagine we have a dataset consisting of blue and red points placed on the same plane.What about more complex datasets that cannot fi nd a straight line to divide?We need to use an algorithm to map that data set into more dimensional space (n dimensions), thereby fi nding a hyperplane to divide [10][11].Here, the author only introduces the SVM algorithm but does not go into it.-Random Forest Algorithm: Random Forest (RF) is a set of ensemble models.The Random Forest model is very eff ective for classifi cation problems because it mobilizes hundreds of smaller internal models with diff erent rules at the same time to make the fi nal decision [12].The unit of RF is the decision tree algorithm, in the number of hundreds.
and buoys in navigational channels are tagged with fake AIS signals for easy identifi cation in crowded regions, i.e. in locations with high vessel traffi c, particularly in places where navigation infrastructure is not adequately developed to fulfi ll the demands of the sea.The maritime traffi c is a real challenge to handle with the increasing demand.The system for eff ectively gathering, integrating, and analyzing data relates to marine navigation.By using historical data mining approaches [3][4], the fundamental issues of anticipating vessel traffi c situations along navigational channels were resolved.A prototype system is validated with the suggested fi xes.The experimental fi ndings demonstrate the viability and effi cacy of the suggested techniques and use in practice.The recommendations for appropriate methods use historical data sources of AIS [5].Ship classes are becoming complex sensors as the digital revolution of the maritime industry continues to grow to support more energy-effi cient marine and vessel operations and to meet the challenges of new legislation.The result of the combination of modern communication systems and advanced sensor technologies is signifi cantly improved vessel connectivity, which allows for the collection and analysis of a large amount of operational data.Specifi cally, the synchronization and analysis of data from various sources will undoubtedly speed up decision-making for operators and improve vessel performance management in critical areas including energy and fuel management, emissions control, machinery and equipment monitoring, and route optimization.Thus, data mining will benefi t the shipping sector by providing fresh insights and added value to improve decision-making, asset tracking, and fl eet-wide optimal application, which is the main purpose of this study.With the aim to present a framework for analyzing historical vessel data in order to predict traffi c conditions in the channels where real-time data collection is insuffi cient in certain areas, the main contributions of this work are as follows: i. Presenting a solution to collect, integrate, and analyze data related to maritime traffi c, then estimate the traffi c status.The proposal is suitable for each navigational channel, thus increasing the accuracy and usefulness of the management information.ii.Suggesting a Maritime Traffi c State Prediction (MTSP) model that aims to help improve management systems and applications.We determine historical maritime traffi c data from AIS for vessel kinematic information with MMSI vessel codes.iii.Proposing the MTSP model based on Long Short-Term Memory (LSTM) Networks to predict the route of vessels in navigational channels in case of real-time data collection failure or discrete data loss.The evaluation results, which used the developed prototype and the collected data sources, were thoroughly analyzed to confi rm the feasibility and eff ectiveness of the proposed methods.The rest of the paper is organized as follows: Section 2 describes an overview of the relevant contents; The related knowledge and problem formulation with the MTSP model based on LSTM network is covered in Section 3; Section 4 describes the results and evaluation of testing process.Finally, conclusions and a potential research direction for the study in the future are presented in Section 5.
Each decision tree is randomly generated from resampling (bootstrap, random sampling) and using only a small set of random features (random features), from all the variables in the data.In the fi nal state, the RF model usually works very accurately, but in return, it is impossible to understand the working mechanism inside the model because the structure is too complicated.
-Deep learning Algorithm: The Long Short-Term Memory (LSTM) network consists of memory blocks, each containing a cell state and three gates [18][19] including the input gate (controls how the input can change the cell state), the output port (sets which part of the cell state to output), and the forged gate (decides how much memory to keep).Remark 1: Maritime data is collected from many diff erent sources and does not have integrated links.Therefore, it is necessary to develop an integrated management system to improve the effi ciency of data exploitation to support maritime safety, such as predicting the possibility of collisions and monitoring vessel mooring wharves.

Specifi c time-series data based AIS / Specifi čni AIS temeljen na vremenskim serijama podataka
In particular, the vessel kinematic information, including latitude (lat), longitude (lon), speed over ground (SOG), and course over ground (COG), plays a critical role in evaluating optimal navigation routes, and predicting the future path of a vessel over specifi c time-series data based on relevant historical data requires analyzing an array of AIS data [13].It is denoted by equations ( 1), ( 2), (3) [14][15] (1) The vessel's historical path (The original AIS data for the MMSI vessel, code of 525100764, expressed by Southern Vietnam Maritime Safety Corporation), which is expressed in Table 1, is represented by a sequence of observation points {X t₀ , X t₁ ,…, X t }, where t i < t j if i < j.Therefore, it is necessary to carry equally sampled observed data to obtain a series of T + 1 as follows: (2) The process of encoding complicated vessel motion data in this space feature poses a signifi cant challenge.Therefore, the solution used is to expand the feature space by one higher dimension.The "four-hot" representation vector is used to separate lat, lon, SOG, and COG data into N lat , N lon , N SOG and N COG bins [16], respectively.The vector h t is expressed by Remark 2: Depending on the weather and traffi c, diff erent features of vessels traveling along the comparable route will be observed.In case of dealing with large inertia vessels, and complex propulsion systems, it is necessary to predict the safe routes.

Dynamic visualization of the vessel movement tracks in Vungtau port / Dinamička vizualizacija putanja kretanja plovila u luci Vungtau
Nowadays, vessels are becoming complex sensors concentrated as the maritime industry's digital revolution gathers increasing volume to support more energy-effi cient marine and vessel operations and support handling the challenges of new legislation.The result of the combination of modern communication systems and advanced sensor technologies is signifi cantly improved vessel connectivity, which allows for the collection and analysis of a large amount of operational data.Specifi cally, the synchronization and analysis of data from various sources will undoubtedly speed up decision-making for operators and improve vessel performance management in critical areas including energy and fuel management, emissions control, machinery and equipment monitoring [17], and route optimization.Thus, data mining will benefi t the shipping sector by providing fresh insights and added value to support improved decision-making, asset tracking, predicting, and fl eet-wide optimal application.The methodology used in this study focused on mining maritime traffi c from historical vessel data, and consists of two stages: data collection and classifi cation in the fi rst stage, together with the required measurement metrics; analysis, and prediction of marine traffi c states using tools or algorithms in the second stage.Maritime traffic-related data is collected from various sources from existing fixed monitoring systems.As we know, static data, dynamic data, and auxiliary data are three types of data, based on the sample of data collection types.One of the sample data used in this paper is described in Table 2.For clarity, we used sample data of dense maps visualized for July 27, 2019, and Fig. 2 shows the dynamic visualization of the vessel movement in Vietnam's southern region.Obviously, the more data is collected, the greater the chances that the system will estimate traffic conditions timely and accurately.To be more precise, we use the vessel's dynamic visualization to estimate traffic circumstances almost in real-time.This allows us to provide suitable models for managing and predicting maritime traffic conditions, even when data segments are missing.Visualizing the initially collected AIS data helps the authors take an overview of the collected dataset.Consequently, the data preprocessing avoids missing data, which leads to the loss of the crucial features of the dataset.The authors determine the coordinate area in order to extract suitable data for the evaluation process based on the data visualization.In addition, data fi elds (such as ship name, call sign, and band) that do not aff ect the goals of training the prediction model are removed in order to increase processing speed.In this study, the visualization data array focuses on vessels with continuous paths and docking in the coastal area of Vung Tau City, with features of type and MMSI identifi er and vessel dynamic data (lat, lon, SOG, and COG).
The labeled data, which expressed the Fairway Maritime Traffi c (FM-Traffi c), may be used for prediction models in data mining techniques.The model assesses the FM-Traffi c and channel conditions where time-series data is not available due to previous vessel data.Fig. 2  Upon obtaining the above-described generic structure (Fig. 1), we need to establish the most important data to efficiently train the model.The fact that the traffic in each channel differs and changes regularly is one of the problems here.For the spatial path, we divide the route network into channels based on the ENC, where each channel is short enough to take into account that the variance of traffic conditions at any location in a channel can be ignored.In terms of time, we divide the time into time frames based on which the collected data is integrated and analyzed.Following the data separation method mentioned above, this model is quite weightless and yet possible in practice.This basic format is simple enough that it can be collected and integrated with any device that uses VHF frequencies, as presented in our previous work [25].

Management integration system / Integracijski sustav upravljanja
The management system consists of four main components, namely the API server, the application, the computing server, and the database server, as shown in Fig. 3 below: The core component of the system is the server, which processes user requests.The server is established in NodeJS -an open-source, cross-platform framework.When users request the application, the data is retrieved from the database, processed, and then sent back to the users.The request life cycle includes receiving and identifying user requests, validating data, and processing requests.
Computing server: The computing server processes the data submitted by users or the AIS.Hence, the computing server performs calculations to return the speed corresponding to the vessel's navigation status on the application map.At the end of each cycle, the Computing Server will update the user's speed and reputation score.If for some routes there is no enough data to calculate the speed, the computing server will refer to other data sources from resource APIs to ensure that the vessel's navigational status is always fully displayed.Moreover, the directory structure of the computing server is almost identical to the directory structure of the API Server.The computing server only computes and stores data, while the API server defi nes the endpoint access points.
Database server: The database server has the task to store data from the AIS identifying information system and process and compute data on the system.As this is a real-time system with big data, the database must have the following features: easy access to vast information; the need for a geospatial database for the navigation system; and confi dentiality to protect the vessel's data.
Application: This application is expressed on an online platform that handles communication with users.The program allows users to view the status of vessels.Additionally, the application is responsible for determining the vessel's route and collecting data for communication with the API server.
The system is architecturally constructed in a modular form with great fl exibility, which enables scale growth with sensor stations (AIS, HMIS, etc.) and working positions (operation desk, training desk).At the same time, the system enables information-sharing interfaces with traffi c management centers.Moreover, the system provides features suitable for each task defi ned based on the primary purpose of the user, including AIS subsystem, VHF, MIS, ENC, and Hydrometeorological data [24][25].In conclusion, the general description of the system components demonstrates the relationship between the system utilization and each function, but in this study we focus on improving the application to support maritime safety.

Maritime traffi c state prediction model based on LSTM network / Model predviđanja stanja pomorskog prometa temeljen na LSTM mreži
In the past ten years, concerns about maritime traffi c safety and security have become evident due to the diffi culties created by the increasing demand for additional vessels with greater capacity and velocity.To ensure the navigational safety, prevent collisions, and improve the eff ectiveness of vessel management, predicting the trajectory of vessels is essential.A relatively recent development for complex geographic applications is the addition of eff ective machine learning technologies to accurately predict trajectories.However, the complexity of the maritime environment and issues with data quality, particularly in the Vietnam Sea Port, which has a high density of vessels, hinder the reliable vessel trajectory predictions.On the other hand, with the system structure selected in Figure 3, the input data is processed and analyzed numerically, stored in Resource APIs, and then aggregated and fi ltered, with the data fi elds being separated.Subsequently, the 04 data fi elds (lat, lon, SOG, COG) selected for the prediction model are also the input data of the Maritime traffi c state prediction model [25].In addition, the MTSP model has not yet been implemented in any maritime management system in Vietnam, which motivates us to propose a solution based on the support of Deep Learning to provide the system with the following superior features.In this study, we suggest the MTSP model based on the LSTM network (LSTM is one of the deep learning algorithms as mentioned) to evaluate suitable paths for vessels along routes and experiments established on data collected from the AIS system through Resource APIs provided by Southern Vietnam Maritime Safety Corporation.The proposed model used for analyzing maritime traffi c data has the following characteristics: -Highly accurate dynamic data analysis results due to direct processing in time-serial format with quick feature extraction; Figure 3 Structure of the management system Slika 3. Struktura sustava upravljanja -Standardized time-serial data sets with maritime traffi c data collection systems facilitate the development of MTSP models; -The LSTM algorithm has a 3-gate structure that enables the processing of multi-layered data feedback.This allows the algorithm to extract deeper data features than the normal RNN algorithm [26].
The LSTM network sequentially computes the input vessel path data string X l with the hidden vector , in which the memory cell corresponding to the input vector (at the current time step x t ) and the hidden state (at the previous time step x t-1 ) update the hidden state inside h t expressed by [20] (4) where ⊙ represents the element product, σ describes the sigmoid function, and tanh is the hyperbolic tangent function.Besides, i, f and o indicate the input gate, forget gate, and output gate, respectively.We get, and express the cell input activation vector and cell state, defi ned as follows: (5) The input weight matrices are represented by ) being the bias terms.The weight matrix subscript indicates the input-output connection.W f is the implicit forgetting gate matrix, and U f is the input-forgetting matrix.The encoder codes the vessel's kinematic state sequence X l one state at a time into a hidden state sequence.We employ an encoder-decoder architecture to solve the prediction problem of mapping one data sequence to another, specifi cally defi ning the mapping function F l, h .The initial encoding function E is represented by [20] (6) where is the neural network parametrized by that maps input sequence to an internal representation data sequence = .Each hidden state h t Є R 2q combines bidirectional recurrent neural network (RNN) with a state of size q.
The encoder layer computes the representation of the input sequence, which created the context representation by the aggregation function.The decoder repeatedly uses this context representation to generate the output prediction.We use the average pooling over time (AVG) to reduce the sequence to a single context vector as (7) for computing the mean value of each hidden unit.Each context feature z r is defi ned as (8) The symbol θ D represents the parameterization of the autoregressive decoder function D to predict the future vessel path ŷ j at each period j with the previous state ŷ j −1 as follows [21]: (9) where u j denotes the RNN hidden state with ψ being the planning descriptor and z j being the context vector.Finally, the output prediction response Ŷ h of length h is given by (10) To evaluate the quality of the prediction model response, the authors employ the root-mean-square-error (RMSE) [22][23] method to estimate the average error value of the squares between the predicted path Ŷ h and the actual path Y i , which is defi ned as (11) LSTMs initially tried to replicate human decision-making by utilizing machines to process large quantities of data.Advanced LSTM systems introduce autonomous vessels, which can operate independently without human intervention and have a lower mistake rate than human-operated vessels.Deep learning is gradually altering the maritime industry's traditional operational processes, especially in mining maritime traffi c from vessel data as mentioned in this paper.This study employed the LSTM network to develop the path prediction model, executed in Python 3.6.The activity results are shown in Fig. 5, the setting up for 50 epochs training using a learning rate of 0.0005.

Case study experimental results / Eksperimentalni rezultati studije slučaja
The optimal values including regression value attained 0.001075 in training, and the loss value reaches 0.0000039103.Besides, a set of historical vessel data from the AIS system is used as input data for the training prediction model to provide a safety path, as shown in Fig. 6.Thus, Fig. 6a illustrates the route taken by a vessel while departing from the wharf and moving towards the sea with the MMSI code 636018224 (indicated by the lightred line).Similarly, the blue line in the fi gure shows the path followed by the vessel while arriving from the sea to the wharf with the MMSI code 574999621.Finally, the dataset for the same type of vessel visualizing the maritime traffi c conditions in VungTau port is indicated in Fig. 6b with vessels Type-1 (on the left side) and Type-3 (on the right side).Finally, the case study experimental results are shown in detail in Fig. 7, the predicted path of the vessel (red line) tracking follows the historical path (blue line) of this vessel.

Evaluations / Procjene
In general, the system displays marine traffi c data calculation and updating and indicates adequate reaction times with initial operation at Vungtau Port, Vietnam.Tests were carried out on using the proposed model from the collected and distributed traffi c data as well as obtaining traffi c information from the AIS system.The results (Figure 6 and Figure 7) outline the Figure 7 Predicted vessel path for arrival at VIETSOVPETRO wharfs in southern Vietnam Slika 7. Predviđena putanja plovila za dolazak na VIETSOVPETRO pristanište u južnom Vijetnamu management integration system, which, once implemented, will enhance the operational effi ciency of the region's specialist maritime management.The system serves the common benefi t of the community and guarantees national security and defense in terms of fi nancial effi ciency.Therefore, determining its eff ectiveness is diffi cult.However, from a socioeconomic standpoint, the initiative has the following consequences: -Support for Maritime management system includes monitoring navigation in narrow channel locations, anchorage positions, berthing, and leaving the wharf; -Support for maritime activity monitoring and management, tracking vessel position, the direction of movement, and speed of vessels.
In the future, the model can develop new application features in ensuring maritime safety by predicting the possibility of collision, predicting the risk of running aground, determining the closest point of approach, monitoring cargo anchorage locations, monitoring and indicating current vessel status to reduce risks to vessel, property, and people, as well as environmental pollution hazards.The advanced and modern technologies in state management methods can be used to increase the attractiveness and competitiveness of the seaport system.In addition, it actively contributes to the gradual perfection of specialized management in the maritime sector through international conventions to which Vietnam is a signatory.To this end, the concerns highlighted in Remark 1 have been addressed and the issues thus resolved.

CONCLUSION / Zaključak
This paper presents several algorithms, including Decision Tree, Naive Bayes, Random Forest, and selected LSTM to specify which model is most appropriate.We determine a new approach to predict vessel traffi c conditions in navigational channels based on historical data from the AIS identifi cation information system.We provided a framework for effi cient collection, integration, and analysis of maritime traffi c-related data to provide an accurate and timely status estimation.In addition, the problem of lack of data in some areas of the navigation channel is still one of the major challenges, and solving it by data mining method based on collected historical data is the solution.The recommended deep neural network algorithm can easily be integrated into a program that runs on a computer or web application to facilitate the mining of historical vessel data for marine traffi c and is ready to be used through the application.In conclusion, synchronizing and improving maritime traffi c is an issue that needs to be addressed, and this is a potential research direction for the study in the future.
Confl ict of interest: The authors state that there is no confl ict of interest.

Figure 1
Figure 1 Maritime traffi c state prediction system Slika 1. Sustav za predviđanje stanja pomorskog prometa depicts the suggested structure for the marine traffi c state prediction model, which is summerized as follows: -Step 1 -Summarizing dynamic data: This step conducts data pre-processing and labeling following the FM-Traffi c, i.e., labels are in the set of (tag block times, lon, lat, SOG, COG, heading).As shown at the beginning of this section, the traffi c conditions, including FM-Traffi c, are already available in the historical vessel data.Concretely, the FM-Traffi c can be calculated directly from velocity extracted from historical vessel data, or it could be the output of this data mining model (ref.Step 3); -Step 2 -Proposing the MTPS model: As discussed, suitable mining data are named based on historical vessel data (i.e., the MF-Traffi c data for the traffi c conditions).The system becomes an experiment by applying deep learning algorithms; -Step 3 -The DL algorithm proposal: The maritime traffi c state prediction model proposed in Step 2 is applied to analyze the actual data to determine the label/FM-Traffi c considering real-time data loss or discrete data loss.

Figure 6
Figure 6 Input data for the creation of a model to predict the path of a vessel in Vietnam's southern sea region.Slika 6. Ulazni podaci za izradu modela za predviđanje putanje plovila u južnom morskom području Vijetnama.