请选择 进入手机版 | 继续访问电脑版
楼主: oliyiyi
1736 1

Dealing with encoding issue in clinical trial data: WLATIN1 and UTF-8 [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
272091 个
通用积分
31269.1729
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383778 点
帖子
9599
精华
66
在线时间
5466 小时
注册时间
2007-5-21
最后登录
2024-3-21

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

oliyiyi 发表于 2016-11-14 10:02:31 |显示全部楼层 |坛友微信交流群

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Nowadays, the clinical trials go to global and are usually multinational. The data collection also goes to the electronic data capture (EDC) and the clinical trial data are entered directly by the investigational sites no matter whether the sites are in English-speaking countries or the non-English speaking countries. One issue we often run into is the data encoding issue.


Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. In contrast, decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters.


To accommodate the multinational trials and the necessity of handling the non-English language characters, the EDC vendors may choose to use the encoding = UTF-8 for their data sets. However, when we use SAS for Windows system, the compatible encoding system is usually WLATIN1.


In the Windows environment, if we try to read a data encoded with UTF-8 format, we will get an error message such as below:


NOTE: Data file xxxxx is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce performance.
ERROR: Some character data was lost during transcoding in the dataset xxxxx Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.


Here is also a discussion about this issue on SAS website.

      ERROR: Some character data was lost during transcoding in the data set" occurs when the data set encoding does not match the SAS® session encoding

To ensure that the data is transcoded correctly from one encoding to another, there are several ways. The following three papers provided very good explanations:



According to paper by Song, there are three ways to change the encoding:
1.    Force the transcoding by specifying that it needs to become WLATIN1, using the dataset option ENCODING=.
data x(encoding='WLATIN1');

set x;

run;
2.    USE PROC DATASETS
The second approach is to use PROC DATASETS as below:

proc datasets lib=libname;

modify x/correctencoding='WLATIN1';

run;
However, this way is NOT recommended: it only changes the encoder indicator but not actually translate the data itself!


3.    USE PROC MIGRATE
When you would like to convert multiple SAS datasets from wlatin1 into UTF-8, you can use PROC MIGRATE.

proc migrate in=inlib out=outlib;

run;
This migrates all SAS datasets in libname inlib to libname outlib. It retains SAS datasets labels as well. Note that inlib and outlib should be two different locations.



Also, we can use the following approaches:

1   1. inencoding option in libname statement.

libname in 'directory\' inencoding=asciiany;
data x;
   set in.x;
run;
    2. Directly use encoding option after the data set


proc sort data=RAWDM.AE(encoding='wlatin1') out=OUTSTATS.AE ;
by subject;run;
Here are some approaches / examples for resolving the data encoding issues from NLS reference guide:

Example 1: Creating a SAS Data Set with Mixed Encodings and with Transcoding Suppressed

By specifying the data set option ENCODING=ANY, you can create a SAS data set that contains mixed encodings, and suppress transcoding for either input or output processing.

In this example, the new data set MYFILES.MIXED contains some data that uses the Latin1 encoding, and some data that uses the Latin2 encoding. When the data set is processed, no transcoding occurs. For example, the correct Latin1 characters in a Latin1 session encoding and correct Latin2 characters in a Latin2 session encoding are displayed.

libname myfiles 'SAS data-library';
data myfiles.mixed (encoding=any);
set work.latin1;
set work.latin2;
run;

Example 2: Creating a SAS Data Set with a Particular Encoding

For output processing, you can override the current session encoding. This action might be necessary, for example, if the normal access to the file uses a different session encoding.

For example, if the current session encoding is Wlatin1, you can specify ENCODING=WLATIN2 in order to create the data set that uses the encoding Wlatin2. The following statements tell SAS to write the data to the new data set using the Wlatin2 encoding instead of the session encoding. The encoding is also specified in the descriptor portion of the file.

libname myfiles 'SAS data-library';
data myfiles.difencoding (encoding=wlatin2);
run;

Example 3: Using the FILE Statement to Specify an Encoding for Writing to an External File

This example creates an external file from a SAS data set. The current session encoding is Wlatin1, but the external file's encoding needs to be UTF-8. By default, SAS writes the external file using the current session encoding.
To specify what encoding to use for writing data to the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';
filename outfile 'external-file';
data _null_;
set myfiles.cars;
file outfile encoding="utf-8";
put Make Model Year;
run;
When you tell SAS that the external file is to be in UTF-8 encoding, SAS then transcodes the data from Wlatin1 to the specified UTF-8 encoding.


Example 4: Using the FILENAME Statement to Specify an Encoding for Reading an External File

This example creates a SAS data set from an external file. The external file is in UTF-8 character-set encoding, and the current SAS session is in the Wlatin1 encoding. By default, SAS assumes that an external file is in the same encoding as the session encoding, which causes the character data to be written to the new SAS data set incorrectly.

To specify which encoding to use when reading the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';
filename extfile 'external-file' encoding="utf-8";
data myfiles.unicode;
infile extfile;
input Make $ Model $ Year;
run;
When you specify that the external file is in UTF-8, SAS then transcodes the external file from UTF-8 to the current session encoding when writing to the new SAS data set. Therefore, the data is written to the new data set correctly in Wlatin1.


Example 5: Using the FILENAME Statement to Specify an Encoding for Writing to an External File

This example creates an external file from a SAS data set. By default, SAS writes the external file using the current session encoding. The current session encoding is Wlatin1, but the external file's encoding needs to be UTF-8.

To specify which encoding to use when writing data to the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';
filename outfile 'external-file' encoding="utf-8";
data _null_;
set myfiles.cars;
file outfile;
put Make Model Year;
run;
When you specify that the external file is to be in UTF-8 encoding, SAS then transcodes the data from Wlatin1 to the specified UTF-8 encoding when writing to the external file.

Example 6: Using the INFILE= Statement to Specify an Encoding for Reading from an External File

This example creates a SAS data set from an external file. The external file's encoding is in UTF-8, and the current SAS session encoding is Wlatin1. By default, SAS assumes that the external file is in the same encoding as the session encoding, which causes the character data to be written to the new SAS data set incorrectly.

To specify which encoding to use when reading the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';
filename extfile 'external-file';
data myfiles.unicode;
infile extfile encoding="utf-8";
input Make $ Model $ Year;
run;
When you specify that the external file is in UTF-8, SAS then transcodes the external file from UTF-8 to the current session encoding when writing to the new SAS data set. Therefore, the data is written to the new data set correctly in Wlatin1.



Incorrect encoding can be stamped on a SAS 7 or SAS 8 data set if it is copied or replaced in a SAS 9 session with a different session encoding from the data. The incorrect encoding stamp can be corrected with the CORRECTENCODING= option in the MODIFY statement in PROC DATASETS. If a character variable contains binary data, transcoding might corrupt the data.
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Clinical DEALING clinic Coding issue clinical

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
水调歌头 在职认证  发表于 2016-11-15 09:53:55 |显示全部楼层 |坛友微信交流群

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-3-28 22:43