Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. In contrast, decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters.
To accommodate the multinational trials and the necessity of handling the non-English language characters, the EDC vendors may choose to use the encoding = UTF-8 for their data sets. However, when we use SAS for Windows system, the compatible encoding system is usually WLATIN1.
In the Windows environment, if we try to read a data encoded with UTF-8 format, we will get an error message such as below:
NOTE: Data file xxxxx is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce performance.
ERROR: Some character data was lost during transcoding in the dataset xxxxx Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.
Here is also a discussion about this issue on SAS website.
ERROR: Some character data was lost during transcoding in the data set" occurs when the data set encoding does not match the SAS® session encoding
To ensure that the data is transcoded correctly from one encoding to another, there are several ways. The following three papers provided very good explanations:
- Hui Song The Impact of Change from wlatin1 to UTF-8 Encoding in SAS Environment
- Bari Lawhorn Encoding: helping SAS speak your language
- SAS® 9.4 National Language Support (NLS) Reference Guide
According to paper by Song, there are three ways to change the encoding:
1. Force the transcoding by specifying that it needs to become WLATIN1, using the dataset option ENCODING=.
data x(encoding='WLATIN1');2. USE PROC DATASETS
set x;
run;
The second approach is to use PROC DATASETS as below:
proc datasets lib=libname;However, this way is NOT recommended: it only changes the encoder indicator but not actually translate the data itself!
modify x/correctencoding='WLATIN1';
run;
3. USE PROC MIGRATE
When you would like to convert multiple SAS datasets from wlatin1 into UTF-8, you can use PROC MIGRATE.
proc migrate in=inlib out=outlib;This migrates all SAS datasets in libname inlib to libname outlib. It retains SAS datasets labels as well. Note that inlib and outlib should be two different locations.
run;
Also, we can use the following approaches:
1 1. inencoding option in libname statement.
libname in 'directory\' inencoding=asciiany;2. Directly use encoding option after the data set
data x;
set in.x;
run;
proc sort data=RAWDM.AE(encoding='wlatin1') out=OUTSTATS.AE ;
by subject;run;Here are some approaches / examples for resolving the data encoding issues from NLS reference guide:
Example 1: Creating a SAS Data Set with Mixed Encodings and with Transcoding Suppressed
By specifying the data set option ENCODING=ANY, you can create a SAS data set that contains mixed encodings, and suppress transcoding for either input or output processing.
In this example, the new data set MYFILES.MIXED contains some data that uses the Latin1 encoding, and some data that uses the Latin2 encoding. When the data set is processed, no transcoding occurs. For example, the correct Latin1 characters in a Latin1 session encoding and correct Latin2 characters in a Latin2 session encoding are displayed.
libname myfiles 'SAS data-library';
data myfiles.mixed (encoding=any);
set work.latin1;
set work.latin2;
run;
Example 2: Creating a SAS Data Set with a Particular Encoding
For output processing, you can override the current session encoding. This action might be necessary, for example, if the normal access to the file uses a different session encoding.
For example, if the current session encoding is Wlatin1, you can specify ENCODING=WLATIN2 in order to create the data set that uses the encoding Wlatin2. The following statements tell SAS to write the data to the new data set using the Wlatin2 encoding instead of the session encoding. The encoding is also specified in the descriptor portion of the file.
libname myfiles 'SAS data-library';
data myfiles.difencoding (encoding=wlatin2);
run;
Example 3: Using the FILE Statement to Specify an Encoding for Writing to an External File
This example creates an external file from a SAS data set. The current session encoding is Wlatin1, but the external file's encoding needs to be UTF-8. By default, SAS writes the external file using the current session encoding.
To specify what encoding to use for writing data to the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';When you tell SAS that the external file is to be in UTF-8 encoding, SAS then transcodes the data from Wlatin1 to the specified UTF-8 encoding.
filename outfile 'external-file';
data _null_;
set myfiles.cars;
file outfile encoding="utf-8";
put Make Model Year;
run;
Example 4: Using the FILENAME Statement to Specify an Encoding for Reading an External File
This example creates a SAS data set from an external file. The external file is in UTF-8 character-set encoding, and the current SAS session is in the Wlatin1 encoding. By default, SAS assumes that an external file is in the same encoding as the session encoding, which causes the character data to be written to the new SAS data set incorrectly.
To specify which encoding to use when reading the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';When you specify that the external file is in UTF-8, SAS then transcodes the external file from UTF-8 to the current session encoding when writing to the new SAS data set. Therefore, the data is written to the new data set correctly in Wlatin1.
filename extfile 'external-file' encoding="utf-8";
data myfiles.unicode;
infile extfile;
input Make $ Model $ Year;
run;
Example 5: Using the FILENAME Statement to Specify an Encoding for Writing to an External File
This example creates an external file from a SAS data set. By default, SAS writes the external file using the current session encoding. The current session encoding is Wlatin1, but the external file's encoding needs to be UTF-8.
To specify which encoding to use when writing data to the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';When you specify that the external file is to be in UTF-8 encoding, SAS then transcodes the data from Wlatin1 to the specified UTF-8 encoding when writing to the external file.
filename outfile 'external-file' encoding="utf-8";
data _null_;
set myfiles.cars;
file outfile;
put Make Model Year;
run;
Example 6: Using the INFILE= Statement to Specify an Encoding for Reading from an External File
This example creates a SAS data set from an external file. The external file's encoding is in UTF-8, and the current SAS session encoding is Wlatin1. By default, SAS assumes that the external file is in the same encoding as the session encoding, which causes the character data to be written to the new SAS data set incorrectly.
To specify which encoding to use when reading the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';When you specify that the external file is in UTF-8, SAS then transcodes the external file from UTF-8 to the current session encoding when writing to the new SAS data set. Therefore, the data is written to the new data set correctly in Wlatin1.
filename extfile 'external-file';
data myfiles.unicode;
infile extfile encoding="utf-8";
input Make $ Model $ Year;
run;
Incorrect encoding can be stamped on a SAS 7 or SAS 8 data set if it is copied or replaced in a SAS 9 session with a different session encoding from the data. The incorrect encoding stamp can be corrected with the CORRECTENCODING= option in the MODIFY statement in PROC DATASETS. If a character variable contains binary data, transcoding might corrupt the data.